Skip to content

data engineering class projects of find top k words

Notifications You must be signed in to change notification settings

ZCai25/de_hw_top_k_words

Repository files navigation

Finding the top k words in a large dataset

The goal of the project is to write code from scratch that can get the k most frequent words from a text file that is as big as 16 GB.

Since the data file is as large as the memory of my laptop, simple approaches like opening the file and reading the whole file will not work.

Thus, the file is read in chunks of 1GB. To make the process faster, multi-processing is used.

The code can be found in Word_freq.ipynb. Stop words used in this project is in stop_words directory. A sample dataset is included as an example: small_50MB_dataset.txt

About

data engineering class projects of find top k words

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published