Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the libre textbooks #149

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

hssn-20
Copy link

@hssn-20 hssn-20 commented Apr 2, 2023

This script imports an uploaded libre chemistry textbooks from Hugging Face, cleans the data by removing hyperlinks, licenses, and chapter headers, and then removes specific lines based on manual selection. The cleaned data is then saved, and a metadata YAML file is generated based on a template. Here's a colab notebook which implements the process.

@hssn-20 hssn-20 changed the title Draft: Adding the libre textbooks Adding the libre textbooks Apr 13, 2023
@hssn-20
Copy link
Author

hssn-20 commented Apr 13, 2023

Hopefully this PR should be ok for our first version of this dataset. In our next version, I'd like to remove exercises along with their solutions from the dataset + encode chemicals in a consistent format. Ps.

@hssn-20
Copy link
Author

hssn-20 commented Apr 13, 2023

pre-commit.ci autofix

@MicPie
Copy link
Contributor

MicPie commented Apr 17, 2023

Hey @hssn-20, thank you very much for the PR! 🙏
I just had a look and I triggered the pre commit checks on GitHub, see the results here: https://results.pre-commit.ci/run/github/601226793/1681519715.6rdNlKF6QWaniPzvuAnS1g (the links is at the end below too).
Best is you (merge the latest main again), then be sure that the latest pre-commit hooks are installed properly with pre-commit install, and then run black . (both in the main directory) to auto-format the code.
Then you can rerun the yaml creation with python transform.py and add those changes in a new commit to the PR.
Just let me know if you can add those changes, if not, I can also have a look. 😃

import yaml


LINES_TO_REMOVE = "/workspaces/chemnlp/data/libre_textbooks/lines_to_remove.jsonl"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used below. Are those lines already removed on the HF dataset upload?

"identifiers": [
{
"id": "url ", # column name
"type": "OTHER", # can be "SMILES", "SELFIES", "IUPAC", "OTHER"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did run the commit hooks through with "OTHER" (capital letters)?

"id": "html", # name of the column in a tabular dataset
"description": "A scraped page from libre textbooks",
"units": None, # units of the values in this column (leave empty if unitless)
"type": "string", # can be "categorical", "ordinal", "continuous", "string"
Copy link
Collaborator

@kjappelbaum kjappelbaum May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"type": "string", # can be "categorical", "ordinal", "continuous", "string"
"type": "text", # can be "categorical", "ordinal", "continuous", "text"

Comment on lines +17 to +19
- id: text_length
type: int
description: text character count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- id: text_length
type: int
description: text character count

@kjappelbaum kjappelbaum requested a review from MicPie May 5, 2023 11:34
@kjappelbaum
Copy link
Collaborator

@MicPie requires that we add the text type also used for #188

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants