Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve countlines() performance #11947

Merged
merged 2 commits into from
Jun 30, 2015
Merged

Improve countlines() performance #11947

merged 2 commits into from
Jun 30, 2015

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Jun 30, 2015

Size of File Base.countlines new countlines
10MB 11.3 milliseconds 7.3 milliseconds
100MB 102 milliseconds 62.5 milliseconds
1GB 928 milliseconds 598 milliseconds
21GB 31 seconds 26 seconds

The first 3 files were auto-generated from random characters + newlines. The 21GB file is from @jiahao's test file from the readdlm issue.

Also added previously missing tests

@jiahao
Copy link
Member

jiahao commented Jun 30, 2015

I restarted two builds - one had a weird type inference error and the other had a network failure.

@quinnj
Copy link
Member Author

quinnj commented Jun 30, 2015

Yeah, the failures seemed unrealted. Let's wait on AppVeyor then merge. This isn't anything spectacular, but close to 2x improvements in some cases is nice.

@jiahao
Copy link
Member

jiahao commented Jun 30, 2015

Glad to see improvements on the data ingest end of the standard library!

@carnaval
Copy link
Contributor

Is it computing the same thing ? The old impl. seems to not be counting empty lines.

@quinnj
Copy link
Member Author

quinnj commented Jun 30, 2015

Yeah, I noticed that. I'm not sure why we wouldn't count empty lines? Looking at the source code for wc it didn't have anything like that.

@quinnj
Copy link
Member Author

quinnj commented Jun 30, 2015

We could maybe provide a skip_emtpy argument, but I'd really like to know if anyone had a use for this.

@ScottPJones
Copy link
Contributor

👍 LGTM

@mbauman
Copy link
Member

mbauman commented Jun 30, 2015

Well, it was introduced in 2eb2493 with a reference to dlmread. I didn't dig any further, but is it (or was it) used to determine the size of an array to preallocate?

@ScottPJones
Copy link
Contributor

OK, I've looked everywhere. This is documented as counting non-empty lines, however, it is not used anywhere in Base, and is only used in one place in any registered package, in Jumos/src/Trajectories/XYZ/Reader.jl, where it is just verifying that the number of lines stored in the header of the file matches the rest of the lines in the file, not performance critical.

@quinnj
Copy link
Member Author

quinnj commented Jun 30, 2015

Thanks for looking around @ScottPJones. My guess is it used to be used by readdlm, but now it's using something else. Sounds like the breaking nature here won't affect much, so I'm inclined to merge and we can add another argument later if a compelling use case comes up. I'll update the docs.

@ScottPJones
Copy link
Contributor

Yes, I'm sad though, because your code looks very nice and fast, only to find out that it's not (currently) used except that once in a package.
Better to merge it in, have the faster,better code for anybody in the future who needs it!

@IainNZ IainNZ added the performance Must go faster label Jun 30, 2015
quinnj added a commit that referenced this pull request Jun 30, 2015
Improve countlines() performance
@quinnj quinnj merged commit e2b78df into master Jun 30, 2015
@quinnj quinnj deleted the jq/countlines branch June 30, 2015 21:26
@quinnj
Copy link
Member Author

quinnj commented Jun 30, 2015

Ref: Luthaf/Jumos.jl#22

if !isascii(eol)
throw(ArgumentError("only ASCII line terminators are supported"))
end
countlines(f::AbstractString,eol::Char='\n') = open(io->countlines(io,eol),f)::Int
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting tidbit for others out there: while the open(func, file) can be quick and tidy, it's also quite subtle from a type stability perspective. Without the ::Int assertion here, @code_warntype shows that the return type for this method is Any, even though countlines itself is type-stable.

nolta pushed a commit that referenced this pull request Sep 19, 2015
nolta pushed a commit that referenced this pull request Sep 19, 2015
Missing from #11947
Closes #12844

(cherry picked from commit 425fa46)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants