Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mtime.value with format string support #2721

Closed
bradenhilton opened this issue Jun 30, 2022 · 3 comments · Fixed by #2739
Closed

mtime.value with format string support #2721

bradenhilton opened this issue Jun 30, 2022 · 3 comments · Fixed by #2739

Comments

@bradenhilton
Copy link
Contributor

I'm currently working on implementing mtime.value with support for format strings and was hoping to gather thoughts and implementation details before I open a PR.

Some use cases:

  • Support for accessing nested metadata (which I believe mtime.key currently cannot do, I also have a workaround for this):
    # weibo example
    mtime.value = "{status[date]:D%Y-%m-%d %H:%M:%S/}"
    
  • Support for parsing datetime from filenames, titles, descriptions, tags e.g.:
    # twitter example
    # if {content} starts with a YYMMDD date such as "220416 ...", but the date
    #   of the post itself is after this date e.g. 2022-05-25
    mtime.value = "{content[0:6]:R22/2022/D%Y%m%d/}" # (prepend century so %Y can be used instead of %y)
    

Some concerns I have:

  • mtime.key and mtime.value logically cannot be used simultaneously. In this case, should an exception be thrown (perhaps a new exception would be required), or should one take precedence over the other?
  • Should the result of the format string be converted to a datetime by default, so that :D<format>/ isn't required? Or should users be expected to ensure the result will already be a valid datetime?

Of course, this will likely become less necessary if/when extractors all provide the same standardized set of metadata keys (filename, url, date, tags etc.).

@mikf
Copy link
Owner

mikf commented Jul 1, 2022

  • mtime.key and mtime.value logically cannot be used simultaneously. In this case, should an exception be thrown (perhaps a new exception would be required), or should one take precedence over the other?

mtime.value should take precedence over mtime.key if present. There are a lot of other places were "competing" options are treated in a similar manner.

  • Should the result of the format string be converted to a datetime by default, so that :D/ isn't required? Or should users be expected to ensure the result will already be a valid datetime?

The result should be expected to be a usable type (datetime, int, float). Makes things easier to implement and it does not involve a potential unnecessary conversion or type check.

I'm guessing you've been trying to use the code from formatter.py, but it currently has one major problem: All values going through the StringFormatter pipeline are implicitly converted to str by calling format() on them.

In your example above, "status[date]" by itself is already a datetime object, but the StringFormatter class automatically converts that to str before returning its value, as would be the case for any datetime created by :D%Y-%m-%d %H:%M:%S/.

@bradenhilton
Copy link
Contributor Author

As I understand it, the current implementation accepts 2 options:

  • A datetime object, which is passed to util.datetime_to_timestamp.
  • A str type whose value is a valid timestamp (int or float), which is passed to text.parse_int.

You've identified that the format string can introduce a new option:

  • A str type whose value is an already parsed datetime.

If further checks and conversions are to be avoided, it looks like we need a way of eliminating this option.

How do you feel about some of these potential solutions?

  • A parse_obj function in the formatter module, which returns objects instead of strings based on the format spec.
  • A new format spec which must be declared last and returns an object instead of a string (not sure how feasible this is).
  • Allowing mtime.value to be a list or dict, where:
    • The first element is the format string without the conversion, e.g. "{content[0:6]:R22/2022/}".
    • The second element is optional and is the conversion string to be passed to text.parse_datetime, e.g. "%Y%m%d".
      • If the second element is not defined, we could assume that the first item will return a timestamp (int or float) and should be covered by the existing text.parse_int.

Of course, you know this codebase better than I ever will, so I'm open to any other ideas you might have.

@mikf
Copy link
Owner

mikf commented Jul 5, 2022

I added a third argument to formatter.parse() (and StringFormatter.__init__()) in 04bed1e, which lets you customize the default "format" function that gets used at the end of the pipeline.

This makes it possible to have it do nothing and return the value as is by using an identity function

formatter.parse("{status[date]}", None, util.identity)
formatter.parse("{content[0:6]:R22/2022/D%Y%m%d/}", None, util.identity)

  • A str type whose value is a valid timestamp (int or float), which is passed to text.parse_int.

It doesn't have to be a str, directly passing int or float works as well.

  • A new format spec which must be declared last and returns an object instead of a string (not sure how feasible this is).

That would have been even easier than my solution, but I noticed it a bit too late, oh well.

A user would specify "status[date]" as mtime.value and the code would use

formatter.parse("{" + mtime.value + ":I")

or something like that, where I is the identity format spec that doesn't do anything and also does not evemtually call format() like all the others do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants