This repo is to help us track the changes and preprocessing methods.
We refer this dataset as the base dataset because it performs common, basic preprocessing methods which can then be further
processed in our independent use cases. This repo will also enforce data version control using DVC,
and allows easy retrieval of the base dataset as long as you have access to the remote server instance-janellah
.
This allows dataset standardization, and a single source of truth across our use cases. We can also easily reproduce this pipeline with other datasets in the future.
.
├── data/ # Stores data for preprocessing
│ ├── base/
│ │ ├── consolidated_profiles.csv # Profiles dataset
│ │ └── consolidated_products.csv # Reviews dataset
│ │ ├── consolidated_product_info.csv #Products dataset
│ │
│ ├── raw/
│ │ ├── consolidated_profiles.csv # Profiles dataset
│ │ └── consolidated_products.csv # Reviews dataset
│ │ ├── consolidated_product_info.csv #Products dataset
│ │
│ └── preprocessing/
│ ├── contractions.txt
│ └── slangs.txt
│
├── configs/ # Stores configuration files for pipeline
│ └── config.json
│
├── installation/ # Required installation files
│ ├── environment.yml
│ └── install.sh
│
├── data_loader/ # DataLoader class to load data from files
│ └── data_loader.py
│
├── preprocess/ # Preprocessor class for preprocessing and cleaning steps
│ └── preprocessor.py
│
├── pipeline/ # Generator class to load all methods and classes required
│ └── generator.py
│
├── utils/ # Cleaning and processing functions
│
├── notebooks/
│
└── generate_data.py # Controls the flow of the process
The input data will be generated from the data extraction repo here, which retrieves reviews, profiles and products dataset from amazon.
The format is listed below, and should be placed in the ./data/raw
folder
stars | profile_name | profile_link | profile_image | title | date | style | verified | comment | voting | review_images | ASIN |
---|---|---|---|---|---|---|---|---|---|---|---|
5.0 out of 5 stars | Willow | /gp/profile/amzn1.account.AHR5T6MM2O3EPWKQS2TBOVXBXLQA | https://images-na.ssl-images-amazon.com/images/S/amazon-avatars-global/e783dd3d-..-SX48_.jpg | Love love love | Reviewed in the United States on December 11, 2019 | Size: 1.7 Ounce | Verified Purchase | Love, love, love this moisturizer! As a woman who has... | One person found this helpful | 0 | B01M09QQI0 |
json_data | acc_num | name | occupation | location | description | badges | ranking |
---|---|---|---|---|---|---|---|
{'marketplaceId': xxx, 'locale': xxx, 'contributions': [{'id': ....}] } | AE25VHDU4KBY2EIVKGZCEG2A | JD Hart | Retired | USA | fitness instructor, writer | null | 887548 |
asin | description | price | rating | availability |
---|---|---|---|---|
B07Z9WFWZ8 | RADIANT SERUM FOUNDATION: This carefully formulated foundation for mature skin ... | $11.97 | 4.3 out of 5 | Only 1 left in stock - order soon. |
On top of existing columns, 13 new ones are added:
Column | Description |
---|---|
cleaned_profile_link | The Url link to the reviewer's information which can be used to scrape for our Profiles Dataset. |
cleaned_title | Title has been cleaned after our data preprocessing steps. |
decoded_comment | Reviews are in ASCII encoding after scraping. Reviews are normalized to UTF-8 which means that accents in characters are removed, ensuring that words like "naïve" will simply be interpreted as (and therefore not differentiated from) "naive". |
cleaned_ratings | Ratings are in a format of x.0 out of 5.0 stars where x are digits between 1 and 5. This column has been transformed and are between the range of 0 and 1. For example, if a review provides a rating of 4.0 out of 5.0 stars, it will simply be 0.8 (4 divided by 5). |
cleaned_verified | For reviews which are verified, 1 will be assigned. Else, 0. |
acc_num | The account number who posted the review. |
cleaned_voting | Number of helpful votes which the review has received. |
cleaned_location | Location in which the reviewer has posted the review from. |
cleaned_date_posted | Date when the reviewer has posted the reviewer. |
poly_obj | Language Object detected by Polyglot |
language | Language detected by Polyglot |
language_confidence | Confidence Score of the language detected by Polyglot. |
cleaned_text | Reviews has been cleaned after our data preprocessing steps using decoded_comment column. |
On top of existing columns, 4 new ones are added:
Column | Description |
---|---|
cleaned_deleted_status | If value is 1, the account has been deleted. Else, the account is still existing. |
reviewer_contributions | A list of review history by the reviewer. |
marketplace_id | The marketplace id of the platform (amazon.com/amazon.uk/amazon.jp) where the reviewer make its account. |
locale | The location of the platform (amazon.com/amazon.uk/amazon.jp) where the reviewer make its account. |
cleaned_ranking | The ranking of the reviewer. |
On top of existing columns, 4 new ones are added:
Column | Description |
---|---|
decoded_comment | Description might be in ASCII encoding after scraping. Description are normalized to UTF-8 which means that accents in characters are removed, ensuring that words like "naïve" will simply be interpreted as (and therefore not differentiated from) "naive". |
cleaned_rating | Ratings are in a format of x.0 out of 5.0 stars where x are digits between 1 and 5. This column has been transformed and are between the range of 0 and 1. For example, if a review provides a rating of 4.0 out of 5.0 stars, it will simply be 0.8 (4 divided by 5). |
cleaned_price | Minimum price of the product |
cleaned_text | Description has been cleaned after our data preprocessing steps using decoded_comment column. |
Give permissions and run the installation script
chmod +x installation/install.sh
installation/install.sh
If you are on windows, you may experience difficulties installing polygot. Follow these steps to install instead
https://github.com/aboSamoor/polyglot/issues/127#issuecomment-492604421
Activate the conda environment
conda activate dvc
To view the list of available arguments
python generate_data.py -h
Specify your data paths in configs/config.json
and generate all datasets (reviews, profiles, products)
python generate_data.py -d all
After dataset has been generated, we can register the changes with DVC
dvc add data/raw data/base
git add data/raw.dvc data/base.dvc
git commit -m "Generated base dataset"
dvc push
git push origin master
To access this version of dataset outside of this project (requires access to remote server instance-janellah
)
dvc get https://github.com/junronglau/base-data-generator data
- Add script to pull latest set of data from ryan's database
- Tag a release + changelog after confirmation with hongliang