-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds from, read, write, to_s methods #52
Conversation
The to_format method is pending, and is quite ambigous as in , returning which format instance would make more sense. Also, the YARD docs is yet to be updated as it's been updated for markdown support in PR #50 and would be better to proceed after that PR is merged.
Minor arguments issue for remote files parsing
2316fad
to
b376af1
Compare
On it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I have a HUGE concern now, when I see the changes. Let's discuss it?
README.md
Outdated
df = Daru::IO::Importers::Format.new(args).call | ||
instance = Daru::IO::Importers::Format.new(opts) | ||
df1 = instance.from(connection) | ||
df2 = instance.read(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look informative for me :(
README.md
Outdated
@@ -130,8 +133,8 @@ Imports a **Daru::DataFrame** from a **.csv** or **.csv.gz** file. | |||
require 'daru/io/importers/csv' | |||
|
|||
#! Usage from Daru::IO | |||
df1 = Daru::IO::Importers::CSV.new('path/to/file.csv', skiprows: 10, col_sep: ' ').call | |||
df2 = Daru::IO::Importers::CSV.new('path/to/file.csv.gz', skiprows: 10, compression: :gzip).call | |||
df1 = Daru::IO::Importers::CSV.new(skiprows: 10, col_sep: ' ').read('path/to/file.csv') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny thing here (which we'll probably postpone, considering how small time we have) is it looks absolutely, totaly, astonishingly unnatural.
If we'll think about importer as an object, usable on itself, in my head the following is logical:
importer = Daru::IO::Importers::CSV.read('mydata.csv', col_sep: ' ') # reads once from file, converts data to intermediate representation
# or
importer = Daru::IO::Importers::CSV.parse(some_string_from_other_source, col_sep: ' ')
df1 = importer.call(skiprows: 10, count: 20) # first 10 lines to one DF, quickly
df2 = importer.call(skiprows: 20) # the rest to another one, quickly
This way we know why we created the object and what we've passed to constructor.
But what is the object created with Daru::IO::Importers::CSV.new(skiprows: 10, col_sep: ' ')
, what's its "physical meaning" or "semantics"?.. ¯\_(ツ)_/¯
Does it make sense for you?..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main idea I had behind the Importers.new(opts).read(path)
was so that it'd facilitate in automating if the Importer is used to collect similar data from LOTS of sources, For example, a folder containing lots of csv files with maybe some daily trends in each file.
That way, the opts
(options) being initialized can be re-used for multiple sources with passing for each call like Importers::CSV.read(path).call(opts)
. This was what I had in mind while having a semantic like Importers::CSV.new(opts).read(path)
.
csv_files = Dir.entries('contains/1000/csvs').keep_if { |file| file.end_with?('.csv')}
csv_importer = Daru::IO::Importers::CSV.new(skiprows: 10, col_sep: ' ')
csv_files.each do |filename|
csv_importer.read(filename)
end.reduce(:concat)
README.md
Outdated
@@ -195,8 +198,8 @@ Imports an **Array** of **Daru::DataFrame**s from a **.html** file or website. | |||
require 'daru/io/importers/html' | |||
|
|||
#! Usage from Daru::IO | |||
df1 = Daru::IO::Importers::HTML.new('https://some/url/with/tables', match: 'market', name: 'Shares analysis').call | |||
df2 = Daru::IO::Importers::HTML.new('path/to/file.html', match: 'market', name: 'Shares analysis').call | |||
df1 = Daru::IO::Importers::HTML.new(match: 'market', name: 'Shares analysis').read('https://some/url/with/tables') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I've said above relates to all importers, but here it is extremely visible:
importer = Daru::IO::Importers::HTML.new('https://some/url/with/20Mb/of/tables') # URL read once, Nokogiri document parsed once
df1 = importer.call(match: 'prices')
df2 = importer.call(match: 'goods')
df3 = importer.call(match: 'clients')
Do you feel it?..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I can see this. But I feel that the existing setup also makes sense when gathering similar data from multiple sources. Something like importing from GitHub API of various repositories,
json_importer = Daru::IO::Importers::JSON.new(RepositoryName: '$..full_name', Stars: '$..stargazers_count', Size: '$..size', Forks: '$..forks_count')
%w[athityakumar zverok v0dro].map do |username|
json_importer.read("https://api.github.com/users/#{username}/repos")
end.reduce(:concat)
But of course, your suggestion does better when the data source is pretty huge. We definitely don't want to read again & again for sure. I can change that, and maybe it can be used like this even for the example I gave like -
json_options = {RepositoryName: '$..full_name', Stars: '$..stargazers_count', Size: '$..size', Forks: '$..forks_count'}
%w[athityakumar zverok v0dro].map do |username|
Daru::IO::Importers::JSON.read("https://api.github.com/users/#{username}/repos").call(json_options)
end.reduce(:concat)
No issues, refactoring this shouldn't take much time (hopefully). 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zverok - I think I've explained what I had in mind while favoring the approach of Importers::Format.new(opts).read(file)
over Importers::Format.read(file).call(opts)
. Which way should we proceed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, you last code example is "right" way (as for me). Creating importer just by JSONPath's provide no value (as nothing "heavy" is done in create, and reparsing JSONPath on each call is cheap)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zverok - Acknowledged. I'll update this PR by tomorrow. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(That's why it is sometimes better to discuss code, than just ideas in GitHub issues... We could do faster if it would not be so hard for me to imagine everything beforehands. Sorry for that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zverok - Added the changes. Please review again. 😄
instance = Daru::IO::Exporters::Format.new(df, opts) | ||
instance.to_s #=> Provides a file-writable string, which can be used in web applications for file download purposes | ||
instance.to #=> Provides a Format instance | ||
instance.write(path) #=> Writes to the given path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exporters API, on the other hand, looks reasonable.
…ll(opts) Rubocop, YARD Docs & README are still a mess. But nevertheless, there's a great deal of code porting that has been done, and I NEVER want to have any of these changes accidentally reverted. RSpec is passing. 🎉
# | ||
# @example Reading from csv.gz file | ||
# instance = Daru::IO::Importers::CSV.read("matrix_test.csv.gz") | ||
def read(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a dilemma about YARD documenting the read
(and from
) method. Basically, there are 2 read
methods - Importers::Format#read
class instance method and Importers::Base#read
class method.
So, I'm showcasing example of Daru::IO::Importers::CSV.read(path)
usage though it's technically the Daru::IO::Importers::CSV.new.read(path)
which is being called by Daru::IO::Importers.read(path)
. So, though it's technically incorrect, I'd like it to stay this way as users would find the method's showcase easier this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can imitate docs for Daru::IO::Importers::CSV.read
(as if it exists and defined in this file) with YARD's @!method directive. I believe it would be ideal solution from all possible points of view, WDYT?
Is it ready for the new round of review? |
@zverok - Apart from the YARD doc issue discussed above, yes. |
OK, on it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df1 = Daru::IO::Importers::CSV.new('path/to/file.csv', skiprows: 10, col_sep: ' ').call | ||
df2 = Daru::IO::Importers::CSV.new('path/to/file.csv.gz', skiprows: 10, compression: :gzip).call | ||
df1 = Daru::IO::Importers::CSV.read('path/to/file.csv').call(skiprows: 10, col_sep: ' ') | ||
df2 = Daru::IO::Importers::CSV.read('path/to/file.csv.gz').call(skiprows: 10, compression: :gzip) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to future (maybe create a GitHub issue?): in fact, options should be split into (file) reading ones, and DF importing ones, like this: Daru::IO::Importers::CSV.read('path/to/file.csv', skiprows: 10, col_sep: ' ').call(order: [....])
-- only this way file reading can be performed on .read
, and conversion then several times on call
. But it would be tricky a bit, so not on this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #53 for this. 👍
The to_format method is pending, and requires more clarity. Also, the YARD docs is yet to be updated as it's been updated for markdown support in PR #50 and would be better to proceed after that PR is merged.