Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(date): add datePublished and datemodified support #374

Merged
merged 16 commits into from
Mar 11, 2023

Conversation

madwings
Copy link
Contributor

@madwings madwings commented Feb 25, 2021

With this pull request I propose adding two additional dates to the date package - datePublished and dateModified.
Some websites use both dates for their articles. There are use cases when you want to know both if they exist or exactly these two not just a date found in the html.

Signed-off-by: Stiliyan Ivanov <[email protected]>
Signed-off-by: Stiliyan Ivanov <[email protected]>
Signed-off-by: Stiliyan Ivanov <[email protected]>
@madwings madwings changed the title Add published and modified dates [WIP] Add published and modified dates Feb 26, 2021
@Kikobeats
Copy link
Member

Thanks for this!

My only concern is to add too many date fields as part of the payload.

I understand date is too simple, so what do you think if we make this a configurable thing?

require('metascraper-date')({
  datePublished: true,
  dateModified: true
})

By doing that we can keep the original date behavior, but also make it possible to be more specific in case you want. Also to make this in a progressive way to see the adoption.

I'm thinking into two user cases for that:

  1. if { datePublished: false, dateModified: false }, then const date = dateModified || datePublished
  2. if { datePublished: false } or { dateModified: false }, then date is not returned.

waiting for your opinion 🙂

@madwings
Copy link
Contributor Author

madwings commented Feb 26, 2021

Yeah that's a good idea. Adding these dates as options is good approach and won't change the current behavior.
But I was thinking of a simpler approach. What about this.

  1. if { datePublished: false, dateModified: false }, then date is returned the same way as now. Default behavior. I didn't include all the rules in these two dates. Only these that imply modified or created date. So there will be times, lots of them, when const date = dateModified || datePublished won't return anything. Unless we reorganize the rules of course.

  2. if { datePublished: true} or { dateModified: true} or both true, then just return what was requested alongside date. Yeah it will add two more fields (at most) in the payload but it will be deliberate and requested from the users. Also it is not unique to this package only. Others also return more than one. So they can do something like this: const date = metascraper.datePublished || metascraper.dateModified || metascraper.date;

@Kikobeats
Copy link
Member

Kikobeats commented Feb 26, 2021

LGTM!

I think it's worth it to sort the rules to avoid repetition and make them easy to read.

Doing that, the default behavior could be easy as add the rules as fallbacks

const date = dateRules() || dateModifiedRules() || datePublishedRules()

@madwings
Copy link
Contributor Author

Yep, should be cleaner this way. I will work when have more time and update the pull request.

@Kikobeats Kikobeats force-pushed the master branch 2 times, most recently from 5e7014f to ad9351e Compare January 30, 2022 01:28
@Kikobeats Kikobeats force-pushed the master branch 3 times, most recently from 76ec51a to e71c2b0 Compare May 6, 2022 08:18
@Kikobeats Kikobeats force-pushed the master branch 5 times, most recently from adab710 to 2b7048e Compare May 30, 2022 11:55
@ghmendonca
Copy link

Is this still going to be merged? I need this feature, so if there is still work to be done, let me know so I can do it

@Kikobeats
Copy link
Member

@ghmendonca do you want to lead this PR?

I will be happy to merge if after it's updated with the current codebase

@ghmendonca
Copy link

@Kikobeats yes, leave it to me, will work on this later today

@Kikobeats
Copy link
Member

@ghmendonca feel free to open a new PR picking the changes from here 🙂

madwings added 2 commits March 9, 2023 01:39
Signed-off-by: Stiliyan Ivanov <[email protected]>
Signed-off-by: Stiliyan Ivanov <[email protected]>
@madwings
Copy link
Contributor Author

madwings commented Mar 8, 2023

Hey guys, @Kikobeats @ghmendonca

This was totally forgotten. I have merged the latest changes I made back in the day. The rules are reordered as we agreed on and config options added. This will lead to breaking changes (the date will be different in certain test cases) as the rules are ordered differently than the previous version. If you want 1:1 backward compatibility we should keep the order for the date the same as before. Tests should be adjusted and new ones added, docs updated too. But I think this will help to finish the PR faster.

Probably closest behaviour to the old order will be:

date: dateModifiedRules().concat(datePublishedRules(), dateRules())

madwings added 5 commits March 9, 2023 02:12
Signed-off-by: Stiliyan Ivanov <[email protected]>
Signed-off-by: Stiliyan Ivanov <[email protected]>
Signed-off-by: Stiliyan Ivanov <[email protected]>
Signed-off-by: Stiliyan Ivanov <[email protected]>
Signed-off-by: Stiliyan Ivanov <[email protected]>
@madwings
Copy link
Contributor Author

madwings commented Mar 9, 2023

I think this pretty much wraps up the pull request. The snapshot should be updated as I have problems with my setup to do it.
And probably if you want date rules reorganized.

@madwings madwings changed the title [WIP] Add published and modified dates Add published and modified dates Mar 9, 2023
@Kikobeats
Copy link
Member

@madwings Thanks for taking care; I can finish it 🙂

@Kikobeats
Copy link
Member

@madwings @Heheehd @ghmendonca Do you know any site we can use for creating an integration test using these new rules?

The rule:

```
toDate($ => $filter($, $('[class*="publish" i]')))
```

is taking more priority than:

```
toDate($ => $('[property*="dc:date" i]').attr('content'))
```

It's okay for this case since it's a fuzzy rule so isn't too deterministic
The rule

```
toDate($ => $('[itemprop="datepublished" i]').attr('content'))
```

It's taking more priority than:

```
toDate($ => $('meta[name="date" i]').attr('content')
```

It's okay for this case since it's a fuzzy rule so isn't too deterministic
@madwings
Copy link
Contributor Author

madwings commented Mar 10, 2023

@madwings @Heheehd @ghmendonca Do you know any site we can use for creating an integration test using these new rules?

Some of the domains we are scraping data from (date) too. If you want I can provide you with a lot more or even better - specific articles.
https://www.dropbox.com/s/r27v47jr14kxva7/article_domains.txt?dl=0

@Kikobeats
Copy link
Member

Thanks a lot! Working to add some sites as part of the integration tests 🙂

@ghmendonca
Copy link

@Kikobeats I was testing with this article here -> https://www.mentalfloss.com/article/90967/14-facts-about-wheres-waldo
The published date and modified date are very different

@Kikobeats
Copy link
Member

@ghmendonca that's actually expected since the new rules set is the latest timestamp vs. first timestamp

For example, in a markup like this:

<meta itemprop="datePublished" content="2016-05-24T09:42:00.000Z">
<meta itemprop="dateModified" content="2016-05-24T21:27:05.000Z">

The date field will be dateModified. That is actually okay, but I want to double-check manually all is under this expectation.

@Kikobeats
Copy link
Member

Another pretty clear example:

CleanShot 2023-03-10 at 16 56 28@2x

The new date field is the latest modified timestamp rather than the publication date

@coveralls
Copy link
Collaborator

Coverage Status

Coverage: 98.068% (-0.2%) from 98.235% when pulling af1691e on madwings:feature/add-date into ff0e5d8 on microlinkhq:master.

@Kikobeats Kikobeats changed the title Add published and modified dates feat(date): add datePublished and datemodified support Mar 11, 2023
@Kikobeats Kikobeats merged commit d219b58 into microlinkhq:master Mar 11, 2023
@Kikobeats
Copy link
Member

shipped!

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants