`mtime.value` with format string support #2721

bradenhilton · 2022-06-30T21:28:47Z

I'm currently working on implementing mtime.value with support for format strings and was hoping to gather thoughts and implementation details before I open a PR.

Some use cases:

Support for accessing nested metadata (which I believe mtime.key currently cannot do, I also have a workaround for this):
```
# weibo example
mtime.value = "{status[date]:D%Y-%m-%d %H:%M:%S/}"
```

Support for parsing datetime from filenames, titles, descriptions, tags e.g.:

# twitter example
# if {content} starts with a YYMMDD date such as "220416 ...", but the date
#   of the post itself is after this date e.g. 2022-05-25
mtime.value = "{content[0:6]:R22/2022/D%Y%m%d/}" # (prepend century so %Y can be used instead of %y)

Some concerns I have:

mtime.key and mtime.value logically cannot be used simultaneously. In this case, should an exception be thrown (perhaps a new exception would be required), or should one take precedence over the other?
Should the result of the format string be converted to a datetime by default, so that :D<format>/ isn't required? Or should users be expected to ensure the result will already be a valid datetime?

Of course, this will likely become less necessary if/when extractors all provide the same standardized set of metadata keys (filename, url, date, tags etc.).

The text was updated successfully, but these errors were encountered:

mikf · 2022-07-01T14:57:47Z

mtime.key and mtime.value logically cannot be used simultaneously. In this case, should an exception be thrown (perhaps a new exception would be required), or should one take precedence over the other?

mtime.value should take precedence over mtime.key if present. There are a lot of other places were "competing" options are treated in a similar manner.

Should the result of the format string be converted to a datetime by default, so that :D/ isn't required? Or should users be expected to ensure the result will already be a valid datetime?

The result should be expected to be a usable type (datetime, int, float). Makes things easier to implement and it does not involve a potential unnecessary conversion or type check.

I'm guessing you've been trying to use the code from formatter.py, but it currently has one major problem: All values going through the StringFormatter pipeline are implicitly converted to str by calling format() on them.

In your example above, "status[date]" by itself is already a datetime object, but the StringFormatter class automatically converts that to str before returning its value, as would be the case for any datetime created by :D%Y-%m-%d %H:%M:%S/.

bradenhilton · 2022-07-01T22:10:13Z

As I understand it, the current implementation accepts 2 options:

A datetime object, which is passed to util.datetime_to_timestamp.
A str type whose value is a valid timestamp (int or float), which is passed to text.parse_int.

You've identified that the format string can introduce a new option:

A str type whose value is an already parsed datetime.

If further checks and conversions are to be avoided, it looks like we need a way of eliminating this option.

How do you feel about some of these potential solutions?

A parse_obj function in the formatter module, which returns objects instead of strings based on the format spec.
A new format spec which must be declared last and returns an object instead of a string (not sure how feasible this is).
Allowing mtime.value to be a list or dict, where:
- The first element is the format string without the conversion, e.g. "{content[0:6]:R22/2022/}".
- The second element is optional and is the conversion string to be passed to text.parse_datetime, e.g. "%Y%m%d".
  - If the second element is not defined, we could assume that the first item will return a timestamp (int or float) and should be covered by the existing text.parse_int.

Of course, you know this codebase better than I ever will, so I'm open to any other ideas you might have.

mikf · 2022-07-05T10:45:09Z

I added a third argument to formatter.parse() (and StringFormatter.__init__()) in 04bed1e, which lets you customize the default "format" function that gets used at the end of the pipeline.

This makes it possible to have it do nothing and return the value as is by using an identity function

formatter.parse("{status[date]}", None, util.identity)
formatter.parse("{content[0:6]:R22/2022/D%Y%m%d/}", None, util.identity)

A str type whose value is a valid timestamp (int or float), which is passed to text.parse_int.

It doesn't have to be a str, directly passing int or float works as well.

A new format spec which must be declared last and returns an object instead of a string (not sure how feasible this is).

That would have been even easier than my solution, but I noticed it a bit too late, oh well.

A user would specify "status[date]" as mtime.value and the code would use

formatter.parse("{" + mtime.value + ":I")

or something like that, where I is the identity format spec that doesn't do anything and also does not evemtually call format() like all the others do.

mikf added a commit that referenced this issue Jul 5, 2022

[formatter] allow for custom "format" functions (#2721)

04bed1e

bradenhilton mentioned this issue Jul 8, 2022

[postprocessor:mtime] add 'value' option #2739

Merged

mikf closed this as completed in #2739 Jul 8, 2022

mikf mentioned this issue Jun 26, 2023

Weibo Unknown Error and archive issue #4221

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`mtime.value` with format string support #2721

`mtime.value` with format string support #2721

bradenhilton commented Jun 30, 2022

mikf commented Jul 1, 2022

bradenhilton commented Jul 1, 2022

mikf commented Jul 5, 2022 •

edited

Loading

mtime.value with format string support #2721

mtime.value with format string support #2721

Comments

bradenhilton commented Jun 30, 2022

mikf commented Jul 1, 2022

bradenhilton commented Jul 1, 2022

mikf commented Jul 5, 2022 • edited Loading

`mtime.value` with format string support #2721

`mtime.value` with format string support #2721

mikf commented Jul 5, 2022 •

edited

Loading