Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds from, read, write, to_s methods #52

Merged
merged 15 commits into from
Aug 31, 2017
Merged

Conversation

athityakumar
Copy link
Member

The to_format method is pending, and requires more clarity. Also, the YARD docs is yet to be updated as it's been updated for markdown support in PR #50 and would be better to proceed after that PR is merged.

The to_format method is pending, and is quite ambigous as in , returning which format instance would make more sense. Also, the YARD docs is yet to be updated as it's been updated for  markdown support in PR #50 and would be better to proceed after that PR is merged.
Minor arguments issue for remote files parsing
@athityakumar athityakumar changed the title [WIP] Adds from, read, write, to_s methods Adds from, read, write, to_s methods Aug 29, 2017
@athityakumar
Copy link
Member Author

@zverok @v0dro - Please review when you're free. I think we'll be good to go with v0.1 release after merging this PR. 😄

@zverok
Copy link
Collaborator

zverok commented Aug 29, 2017

On it.

Copy link
Collaborator

@zverok zverok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I have a HUGE concern now, when I see the changes. Let's discuss it?

README.md Outdated
df = Daru::IO::Importers::Format.new(args).call
instance = Daru::IO::Importers::Format.new(opts)
df1 = instance.from(connection)
df2 = instance.read(path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look informative for me :(

README.md Outdated
@@ -130,8 +133,8 @@ Imports a **Daru::DataFrame** from a **.csv** or **.csv.gz** file.
require 'daru/io/importers/csv'

#! Usage from Daru::IO
df1 = Daru::IO::Importers::CSV.new('path/to/file.csv', skiprows: 10, col_sep: ' ').call
df2 = Daru::IO::Importers::CSV.new('path/to/file.csv.gz', skiprows: 10, compression: :gzip).call
df1 = Daru::IO::Importers::CSV.new(skiprows: 10, col_sep: ' ').read('path/to/file.csv')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny thing here (which we'll probably postpone, considering how small time we have) is it looks absolutely, totaly, astonishingly unnatural.

If we'll think about importer as an object, usable on itself, in my head the following is logical:

importer = Daru::IO::Importers::CSV.read('mydata.csv', col_sep: ' ') # reads once from file, converts data to intermediate representation
# or
importer = Daru::IO::Importers::CSV.parse(some_string_from_other_source, col_sep: ' ')
df1 = importer.call(skiprows: 10, count: 20) # first 10 lines to one DF, quickly
df2 = importer.call(skiprows: 20) # the rest to another one, quickly

This way we know why we created the object and what we've passed to constructor.
But what is the object created with Daru::IO::Importers::CSV.new(skiprows: 10, col_sep: ' '), what's its "physical meaning" or "semantics"?.. ¯\_(ツ)_/¯

Does it make sense for you?..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main idea I had behind the Importers.new(opts).read(path) was so that it'd facilitate in automating if the Importer is used to collect similar data from LOTS of sources, For example, a folder containing lots of csv files with maybe some daily trends in each file.

That way, the opts (options) being initialized can be re-used for multiple sources with passing for each call like Importers::CSV.read(path).call(opts). This was what I had in mind while having a semantic like Importers::CSV.new(opts).read(path).

csv_files        = Dir.entries('contains/1000/csvs').keep_if { |file| file.end_with?('.csv')}
csv_importer = Daru::IO::Importers::CSV.new(skiprows: 10, col_sep: ' ')

csv_files.each do |filename|
    csv_importer.read(filename)
end.reduce(:concat)

README.md Outdated
@@ -195,8 +198,8 @@ Imports an **Array** of **Daru::DataFrame**s from a **.html** file or website.
require 'daru/io/importers/html'

#! Usage from Daru::IO
df1 = Daru::IO::Importers::HTML.new('https://some/url/with/tables', match: 'market', name: 'Shares analysis').call
df2 = Daru::IO::Importers::HTML.new('path/to/file.html', match: 'market', name: 'Shares analysis').call
df1 = Daru::IO::Importers::HTML.new(match: 'market', name: 'Shares analysis').read('https://some/url/with/tables')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I've said above relates to all importers, but here it is extremely visible:

importer = Daru::IO::Importers::HTML.new('https://some/url/with/20Mb/of/tables') # URL read once, Nokogiri document parsed once
df1 = importer.call(match: 'prices')
df2 = importer.call(match: 'goods')
df3 = importer.call(match: 'clients')

Do you feel it?..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I can see this. But I feel that the existing setup also makes sense when gathering similar data from multiple sources. Something like importing from GitHub API of various repositories,

json_importer = Daru::IO::Importers::JSON.new(RepositoryName: '$..full_name', Stars: '$..stargazers_count', Size: '$..size', Forks: '$..forks_count')

%w[athityakumar zverok v0dro].map do |username|
    json_importer.read("https://api.github.com/users/#{username}/repos")
end.reduce(:concat)

But of course, your suggestion does better when the data source is pretty huge. We definitely don't want to read again & again for sure. I can change that, and maybe it can be used like this even for the example I gave like -

json_options = {RepositoryName: '$..full_name', Stars: '$..stargazers_count', Size: '$..size', Forks: '$..forks_count'}

%w[athityakumar zverok v0dro].map do |username|
    Daru::IO::Importers::JSON.read("https://api.github.com/users/#{username}/repos").call(json_options)
end.reduce(:concat)

No issues, refactoring this shouldn't take much time (hopefully). 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zverok - I think I've explained what I had in mind while favoring the approach of Importers::Format.new(opts).read(file) over Importers::Format.read(file).call(opts). Which way should we proceed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you last code example is "right" way (as for me). Creating importer just by JSONPath's provide no value (as nothing "heavy" is done in create, and reparsing JSONPath on each call is cheap)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zverok - Acknowledged. I'll update this PR by tomorrow. 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(That's why it is sometimes better to discuss code, than just ideas in GitHub issues... We could do faster if it would not be so hard for me to imagine everything beforehands. Sorry for that.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zverok - Added the changes. Please review again. 😄

instance = Daru::IO::Exporters::Format.new(df, opts)
instance.to_s #=> Provides a file-writable string, which can be used in web applications for file download purposes
instance.to #=> Provides a Format instance
instance.write(path) #=> Writes to the given path
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exporters API, on the other hand, looks reasonable.

…ll(opts)

Rubocop, YARD Docs & README are still a mess. But nevertheless, there's a great deal of code porting that has been done, and I NEVER want to have any of these changes accidentally reverted. RSpec is passing. 🎉
#
# @example Reading from csv.gz file
# instance = Daru::IO::Importers::CSV.read("matrix_test.csv.gz")
def read(path)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a dilemma about YARD documenting the read (and from) method. Basically, there are 2 read methods - Importers::Format#read class instance method and Importers::Base#read class method.

So, I'm showcasing example of Daru::IO::Importers::CSV.read(path) usage though it's technically the Daru::IO::Importers::CSV.new.read(path) which is being called by Daru::IO::Importers.read(path). So, though it's technically incorrect, I'd like it to stay this way as users would find the method's showcase easier this way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can imitate docs for Daru::IO::Importers::CSV.read (as if it exists and defined in this file) with YARD's @!method directive. I believe it would be ideal solution from all possible points of view, WDYT?

@zverok
Copy link
Collaborator

zverok commented Aug 31, 2017

Is it ready for the new round of review?

@athityakumar
Copy link
Member Author

@zverok - Apart from the YARD doc issue discussed above, yes.

@zverok
Copy link
Collaborator

zverok commented Aug 31, 2017

OK, on it.

Copy link
Collaborator

@zverok zverok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

df1 = Daru::IO::Importers::CSV.new('path/to/file.csv', skiprows: 10, col_sep: ' ').call
df2 = Daru::IO::Importers::CSV.new('path/to/file.csv.gz', skiprows: 10, compression: :gzip).call
df1 = Daru::IO::Importers::CSV.read('path/to/file.csv').call(skiprows: 10, col_sep: ' ')
df2 = Daru::IO::Importers::CSV.read('path/to/file.csv.gz').call(skiprows: 10, compression: :gzip)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to future (maybe create a GitHub issue?): in fact, options should be split into (file) reading ones, and DF importing ones, like this: Daru::IO::Importers::CSV.read('path/to/file.csv', skiprows: 10, col_sep: ' ').call(order: [....]) -- only this way file reading can be performed on .read, and conversion then several times on call. But it would be tricky a bit, so not on this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #53 for this. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants