Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler transform #797

Merged
merged 26 commits into from
Nov 16, 2024
Merged

Crawler transform #797

merged 26 commits into from
Nov 16, 2024

Commits on Nov 8, 2024

  1. first implementation of web2parquet for crawling/downloading from see…

    …dURLs
    
    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 8, 2024
    Configuration menu
    Copy the full SHA
    41bed68 View commit details
    Browse the repository at this point in the history

Commits on Nov 11, 2024

  1. use makefile template

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 11, 2024
    Configuration menu
    Copy the full SHA
    cf516b5 View commit details
    Browse the repository at this point in the history

Commits on Nov 13, 2024

  1. complete full implementation and testing with python runtime

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 13, 2024
    Configuration menu
    Copy the full SHA
    acc35cd View commit details
    Browse the repository at this point in the history
  2. identified current requirements for web2parquet module

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 13, 2024
    Configuration menu
    Copy the full SHA
    3e05f30 View commit details
    Browse the repository at this point in the history
  3. relaxed dependencies

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 13, 2024
    Configuration menu
    Copy the full SHA
    5710653 View commit details
    Browse the repository at this point in the history
  4. added build target

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 13, 2024
    Configuration menu
    Copy the full SHA
    80e4ebe View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    cf20268 View commit details
    Browse the repository at this point in the history

Commits on Nov 14, 2024

  1. added licence block

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 14, 2024
    Configuration menu
    Copy the full SHA
    4dcebb6 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    137d92c View commit details
    Browse the repository at this point in the history
  3. fix filename issue

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 14, 2024
    Configuration menu
    Copy the full SHA
    d2404f4 View commit details
    Browse the repository at this point in the history
  4. generate cicd workflow for new transform

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 14, 2024
    Configuration menu
    Copy the full SHA
    1e810d0 View commit details
    Browse the repository at this point in the history
  5. build image only if a Dockerfile is defined

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 14, 2024
    Configuration menu
    Copy the full SHA
    fcbcc0a View commit details
    Browse the repository at this point in the history
  6. Ignore page content as long as we get the right count

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 14, 2024
    Configuration menu
    Copy the full SHA
    b5031c9 View commit details
    Browse the repository at this point in the history

Commits on Nov 15, 2024

  1. rename make.cicd.target

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    9ad3d18 View commit details
    Browse the repository at this point in the history
  2. updated notebook with example

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    c9c9779 View commit details
    Browse the repository at this point in the history
  3. updated notebook with example

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    b77bbe9 View commit details
    Browse the repository at this point in the history
  4. added readme.md

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    8e71177 View commit details
    Browse the repository at this point in the history
  5. fix typos

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    ef7c57d View commit details
    Browse the repository at this point in the history
  6. More typos

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    8c55ad8 View commit details
    Browse the repository at this point in the history
  7. more typos

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    ba4b0a4 View commit details
    Browse the repository at this point in the history
  8. more typos

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    6ea2e76 View commit details
    Browse the repository at this point in the history
  9. reference nested asyncio project

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    670f381 View commit details
    Browse the repository at this point in the history
  10. fix typo

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    46b168a View commit details
    Browse the repository at this point in the history
  11. added instructions for installing the webcrawler module

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    190969b View commit details
    Browse the repository at this point in the history
  12. added the module to the transform package

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    96e46c7 View commit details
    Browse the repository at this point in the history
  13. added requirements for web2parquet

    Signed-off-by: Maroun Touma <[email protected]>
    touma-I committed Nov 15, 2024
    Configuration menu
    Copy the full SHA
    4a59970 View commit details
    Browse the repository at this point in the history