Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structured Data 2021 #2174

Closed
6 tasks done
rviscomi opened this issue Apr 27, 2021 · 42 comments · Fixed by #2466
Closed
6 tasks done

Structured Data 2021 #2174

rviscomi opened this issue Apr 27, 2021 · 42 comments · Fixed by #2466
Assignees
Labels
2021 chapter Tracking issue for a 2021 chapter

Comments

@rviscomi
Copy link
Member

rviscomi commented Apr 27, 2021

Part I Chapter 4: Structured Data

If you're interested in contributing to the Structured Data chapter of the 2021 Web Almanac, please reply to this issue and indicate which role or roles best fit your interest and availability: author, reviewer, analyst, and/or editor.

Content team

Lead Authors Reviewers Analysts Editors Coordinator
@jonoalderson @jonoalderson @cyberandy @kevinmarks @vdwijngaert @jvandriel @philbarker @GregBrimble @jvandriel @JasmineDWillson @rviscomi
Expand for more information about each role
  • The content team lead is the chapter owner and responsible for setting the scope of the chapter and managing contributors' day-to-day progress.
  • Authors are subject matter experts and lead the content direction for each chapter. Chapters typically have one or two authors. Authors are responsible for planning the outline of the chapter, analyzing stats and trends, and writing the annual report.
  • Reviewers are also subject matter experts and assist authors with technical reviews during the planning, analyzing, and writing phases.
  • Analysts are responsible for researching the stats and trends used throughout the Almanac. Analysts work closely with authors and reviewers during the planning phase to give direction on the types of stats that are possible from the dataset, and during the analyzing/writing phases to ensure that the stats are used correctly.
  • Editors are technical writers who have a penchant for both technical and non-technical content correctness. Editors have a mastery of the English language and work closely with authors to help wordsmith content and ensure that everything fits together as a cohesive unit.
  • The section coordinator is the overall owner for all chapters within a section like "User Experience" or "Page Content" and helps to keep each chapter on schedule.

Note: The time commitment for each role varies by the chapter's scope and complexity as well as the number of contributors.

For an overview of how the roles work together at each phase of the project, see the Chapter Lifecycle doc.

Milestone checklist

0. Form the content team

  • May 31: The content team has at least one author, reviewer, and analyst

1. Plan content

  • June 15 The content team has completed the chapter outline in the draft doc

2. Gather data

  • June 30: Analysts have added all necessary custom metrics and drafted a PR (example) to track query progress
  • July 1 - 31: HTTP Archive runs the July crawl

3. Validate results

  • September 30: Analysts have queried all metrics and saved the output to the results sheet

4. Draft content

  • October 31: The content team has written, reviewed, and edited the chapter in the doc

5. Publication

  • November 15: The completed chapter and all required metadata and figures are converted to markdown and submitted to GitHub
  • December 1: Target launch date 🚀

Chapter resources

Refer to these 2021 Structured Data resources throughout the content creation process:

📄 Google Docs for outlining and drafting content
🔍 SQL files for committing the queries used during analysis
📊 Google Sheets for saving the results of queries
📝 Markdown file for publishing content and managing public metadata

@rviscomi rviscomi added 2021 chapter Tracking issue for a 2021 chapter help wanted Extra attention is needed labels Apr 27, 2021
@jonoalderson
Copy link
Contributor

jonoalderson commented Apr 27, 2021

As discussed in Slack, I'd be very keen to author this. I'd also be happy to take my hat out of the 'Author' ring (and to play Reviewer instead) for #2148 so as to be able to resource this effectively (which I've updated accordingly).

@rviscomi
Copy link
Member Author

rviscomi commented May 4, 2021

@jono-alderson thanks for your interest in authoring this chapter! As the content team lead, you'll be responsible for the scope and direction of the chapter and keeping it on schedule. We automatically monitor the staffing and progress of each chapter based on the state of the initial comment so please keep that updated as you add new contributors and meet each milestone.

We've created a Google Doc for this chapter, which you're encouraged to use to collaborate with the content team on the initial outline, metrics, and ultimately the final draft.

Next steps for this chapter are:

There's not currently a section coordinator for this chapter, so I'll be periodically checking in with you directly to make sure the chapter is staying on schedule. Reach out here in this issue if you have any questions about the process.

More information about the content team lead and author roles and responsibilities are available for reference in the wiki if needed.

To anyone else interested in contributing to this chapter, please comment below to join the team!

@rviscomi rviscomi added help wanted: analysts This chapter is looking for data analysts help wanted: reviewers This chapter is looking for reviewers labels May 4, 2021
@GregBrimble
Copy link
Member

Hey @jono-alderson ,

If you'll have me, I'd love to help out with the analysis for this chapter, this year!

@jonoalderson
Copy link
Contributor

jonoalderson commented May 6, 2021

Hey @jono-alderson ,
If you'll have me, I'd love to help out with the analysis for this chapter, this year!

That'd be wonderful, thanks! NB, I'm aiming to start outlining a plan and firing out some comms this weekend :)

@rviscomi
Copy link
Member Author

rviscomi commented May 11, 2021

Hi @jono-alderson just checking in. Here are some tips to help keep the chapter on track:

  • Request edit access to the doc and start brainstorming an outline for the chapter
  • Consider announcing to your professional networks that you're looking for co-contributors knowledgable in structured data to join the chapter
  • Edit the top comment to keep the chapter metadata in sync with all reviewers and analysts and also any completed milestones (helpful for us to monitor progress at a glance in 2021 Chapter Progress #2179)

⚠️ Note that if we're unable to meet Milestone 0 by May 31 we may have to close this chapter and refocus our efforts on other chapters.

@rviscomi rviscomi mentioned this issue May 11, 2021
6 tasks
@vdwijngaert
Copy link
Contributor

Happy to help if you guys need any more reviewers :)

@kevinmarks
Copy link

You asked about microformats - I'm happy to help review on that area, and help those running analyses make sense of them.

@jonoalderson
Copy link
Contributor

Happy to help if you guys need any more reviewers :)

Thanks - more reviewers are definitely welcome! I have a feeling that we're going to need lots of hands on deck for this!

@jonoalderson
Copy link
Contributor

jonoalderson commented May 11, 2021

You asked about microformats - I'm happy to help review on that area, and help those running analyses make sense of them.

Thanks, Kevin, that'd be amazing. I'm conscious that whilst schema.org and JSON-LD is very trendy at the moment, there's lot of structured data out there in legacy formats that I'm keen for us not to overlook. I'll add you as a reviewer! Delightful to have your input.

@jonoalderson
Copy link
Contributor

@rviscomi I don't appear to be able to edit the top comment; do I need some permissions?

@GregBrimble
Copy link
Member

GregBrimble commented May 11, 2021

I've apparently got edit access, so I've added @kevinmarks and @vdwijngaert as reviewers, and myself as an analyst, @jono-alderson :)

I've also checked off that May 31st milestone since we now have at least one of each role. Do you want to remove the help wanted badges, or are you still looking for more people to help out?

@jonoalderson
Copy link
Contributor

jonoalderson commented May 11, 2021

Thanks! Still happy to invite more folks. It's a big topic, so I'm happy to cat-herd involvement from a wider pool potentially; unless there are good reasons not to?

Could you also add @jvandriel as a reviewer and editor, please? :)

@GregBrimble
Copy link
Member

Nope, I'm sure that's fine to leave the badges up if we're still looking for people :)

Added, and also put everyone in the frontmatter of the Google doc as well.

@cyberandy
Copy link
Contributor

Hi all 👋 happy to contribute on this one - either as author or editor, whatever feels more necessary.

@jvandriel
Copy link

I'm happy to join and help out as well - also very curious to see the outcome

@rviscomi
Copy link
Member Author

rviscomi commented May 11, 2021

@rviscomi I don't appear to be able to edit the top comment; do I need some permissions?

@jono-alderson you'll need to accept our invitation to join the HTTP Archive team in order to get edit access on GitHub. Check your email or visit https://github.com/HTTPArchive/ to accept.

Happy to see the increased interest in this chapter!

@rviscomi rviscomi removed help wanted Extra attention is needed help wanted: analysts This chapter is looking for data analysts help wanted: reviewers This chapter is looking for reviewers labels May 11, 2021
@rviscomi
Copy link
Member Author

Here's the sharable link for anyone to join the Slack channel: https://join.slack.com/t/httparchive/shared_invite/zt-45sgwmnb-eDEatOhqssqNAKxxOSLAaA

@jonoalderson
Copy link
Contributor

Looks like we have everybody in Slack except for @vdwijngaert; are you able to join us, Koen? :)

@jonoalderson jonoalderson added the help wanted: analysts This chapter is looking for data analysts label May 16, 2021
@philbarker
Copy link

@jono-alderson I'm here because @jvandriel asked, then I saw your tweet asking for involvement from people with expertise in Dublin Core / other metadata. I might be able to help as reviewer, if you still need such help.

@jonoalderson
Copy link
Contributor

Hi @philbarker, thanks for reaching out! That'd be amazing; I'll add you to the team list! I know I'm personally weak on knowledge around DC, so keen to have an expert involved!

Please feel free to jump into the Slack channel, and contribute any ideas/direction, etc!

@rviscomi rviscomi removed the help wanted: analysts This chapter is looking for data analysts label May 17, 2021
@rviscomi
Copy link
Member Author

All, the outline in the chapter doc is looking great. Nice work! 🚀

@jono-alderson is the outline complete, or are you still adding to it?

@jonoalderson
Copy link
Contributor

Getting there!
Hoping for a bit more feedback from the crew, as I feel there's more we could do without being too over-ambitious. Any ideas, folks?

@GregBrimble
Copy link
Member

One thing I might suggest would be a deeper integration with knowledge graphs like Wikidata. If I've got this structured data on a page:

{
  "@type": "Person",
  "name": "Greg Brimble",
  "nationality": {
    "@type": "Country",
    "name": "United Kingdom"
  },
  "sameAs": ["https://www.wikidata.org/wiki/Q52444075"]
}

and Wikidata has this:

"instance of" → "human"
(P31 → Q5)

"country of citizenship" → "United Kingdom"
(P27 → Q145)
  • How much overlap is there? Does the structured data provide information not found in Wikidata, or vice-versa?
  • Are there any inconsistent claims?
  • Do the types of the entity match?
  • etc.

This is getting dangerously close to what my undergraduate dissertation was on 😅 The difficulty is in doing the ontology matching (finding equivalent properties and entities), which might be a bit out-of-scope for this analysis (e.g. Schema.org's "Person" ≠ Q5, but Schema.org's "nationality" === P27).

@jonoalderson
Copy link
Contributor

That'd be pretty awesome, but I think that comparing to external sources at scale is going to be waaayyy out of scope.
We should definitely put more attention on sameAs declarations, though; there's bound to be some interesting findings in directing common hostnames and patterns in there.

@JasmineDWillson
Copy link

JasmineDWillson commented May 28, 2021

Might be interesting to touch on the use of sameAs:

  1. In terms of its limitations
  1. Or ways that others have tried to navigate mapping to terms that are not entirely equivalent
  • e.g. the CWRC ontology's hasFunctionalRelation predicate which "Relates...to external terms that are semantically incommensurate but that may be pragmatically related for processing purposes such as search and retrieval"

without diving too deeply into the mire of ontology mapping...

@rviscomi
Copy link
Member Author

Hey @jono-alderson, could you give an update on the chapter outline? I see some new topics added today, but not sure if it's still being worked on. If it's finalized you could check off Milestone 1 above, otherwise let us know when you think it'll be ready. Thanks!

@GregBrimble please take a close look at the outline to see whether we need any custom metrics to extract structured data info from the DOM at runtime. Those would need to be written and merged no later than the end of the month to be added to the test pipeline in time.

@jonoalderson
Copy link
Contributor

Hello hello! I'm happy with the chapter outline, and will check off the milestone now.

@GregBrimble, I think we need to explore your message in Slack (https://httparchive.slack.com/archives/C021GGN9W4D/p1623610269059000) ASAP, as that might influence our next steps.

@GregBrimble
Copy link
Member

And we've got the run's results! July's data is up so we can now play around in BigQuery.

I've started the queries in #2293, and have requested edit access to the results sheet so I can start putting stuff down there. Checked the error log as a first priority, and so far, it looks pretty good. We have our structured data custom metrics on 13,775,158 of the 13,778,213 pages we've run against. We captured 508 error logs, and I'm assuming the rest (2,547) failed so hard that we couldn't even capture the exception. 99.98% success is good enough for me.

This analysis is due September 30, but I can't imagine it takes nearly that long. The hardest bit will be the JSON-LD parsing, which in all likelihood I'm going to do locally. I'll do bits and pieces over the next few days, so keep an eye on that linked PR to follow along :)

@jonoalderson
Copy link
Contributor

This is monumentally exciting!
I'll make a start on some of the 'generic' content/write-up (fleshed out introductions, etc) in the meantime!

@rviscomi
Copy link
Member Author

👋 Hi @jono-alderson @cyberandy @GregBrimble, just checking in on the chapter progress. How is the analysis coming along?

@GregBrimble
Copy link
Member

Hey, made decent progress the weekend before last, but it's a busy week at work, this week, so haven't had a chance to get back to it. I'll get this completed next weekend ☺️

@jonoalderson
Copy link
Contributor

Any updates from your side, @GregBrimble ?

@rviscomi
Copy link
Member Author

rviscomi commented Nov 29, 2021

@jonoalderson @cyberandy @kevinmarks @vdwijngaert @jvandriel @philbarker @GregBrimble @jvandriel @JasmineDWillson

Thank you all for your hard work getting this chapter over the finish line in time for the pre-release—Structured Data has been the most-read (English-version) chapter in the past couple of weeks! Congratulations on finishing the chapter, and I'm excited to see us launch the rest of the chapters along side it on Wednesday 🎉

When you get 5 minutes, I'd really appreciate if you could fill out our contributor survey to tell us (the project leads) about your experience. It's super helpful to hear what went well or what could be improved for next time. 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2021 chapter Tracking issue for a 2021 chapter
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants