Skip to content

Conversation

@cejiogu
Copy link

@cejiogu cejiogu commented Oct 15, 2025

Overview

In this commit, I:

  • Refactor the manner through which the application scrapes information about campus printers from the web
  • Standardize all scraped information
  • Introduce labels as another field of information for each printer.

Changes Made

Change 1: Printer information is retrieved from the API request that populates the page
Previously, the application used BeautifulSoup to parse the HTML of the webpage that displays details about each printer on campus; with this change, I transition the application from HTML scraping to API scraping, now utilizing the API request that populates the HTML table directly instead of reading from the webpage. Now, details about each printer on campus are retrieved from the JSON object returned as a response to the API call that populates the page. This change makes our information retrieval cleaner, safer, and more efficient.

Change 2: Printer location information is standardized
To account for minor "mutations" in the scraped data, I also used the difflib Python library to map scraped building names to a canonical list of building names, ensuring that all scraped locations are standardized. This ensures that if "Baker Lab CLOSED FOR CONSTRUCTION," for example, is scraped from the website, we still only see "Baker Lab" on the actual application.

Change 3: Introduce labels as another field of information for each printer
To also implement labels for each printer, I include "Labels" as another field in each object in the list returned from the scrape_printers function. To populate this field, I created a canonical list of labels — notably, which only accounts for "Residents Only," "AA&P Students Only," and "Landscape Architecture Students Only," and is unlikely to be exhaustive — and then used the difflib Python library to recognize any canonical labels in the parsed data. I also include printer capabilities as labels, meaning that "Color," "Black & White," and "Color, Scan, & Copy" are labels as well.

For this information to be stored in our SQLite database, I also introduced two new tables: a "labels" table — which stores the unique labels a given printer may have — and a "printer_labels" table — which is a junction table, mapping each printer to its corresponding labels.

Finally, for this information to be retrieved via our API, I updated the fetchAllPrinters function found in EcosystemUtils.js also return a list of labels for each returned printer.

Test Coverage

To the refactored web scraping, the label parsing and mapping, and the location parsing and mapping, I ran src/data/scrapers/printers.py as a module to ensure that each location was mapped to a name in the canon, and that the correct labels were assigned to each printer. To do this, I pasted the following code at the bottom of my file.

if __name__ == "__main__":
    results = scrape_printers()
    print(f"Scraped {len(results)} printers.\n")

    # Print a sample of the data
    for row in results:
        print(row)

Then, I moved my working directory to the scrapers folder, and then ran the file using the following command.

python3 printers.py

@cejiogu cejiogu marked this pull request as ready for review October 15, 2025 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant