Finding the top k words in a large dataset

The goal of the project is to write code from scratch that can get the k most frequent words from a text file that is as big as 16 GB.

Since the data file is as large as the memory of my laptop, simple approaches like opening the file and reading the whole file will not work.

Thus, the file is read in chunks of 1GB. To make the process faster, multi-processing is used.

The code can be found in Word_freq.ipynb. Stop words used in this project is in stop_words directory. A sample dataset is included as an example: small_50MB_dataset.txt

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.ipynb_checkpoints		.ipynb_checkpoints
stop_words		stop_words
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
Word_freq.ipynb		Word_freq.ipynb
small_50MB_dataset.txt		small_50MB_dataset.txt
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finding the top k words in a large dataset

About

Releases

Packages

Languages

ZCai25/de_hw_top_k_words

Folders and files

Latest commit

History

Repository files navigation

Finding the top k words in a large dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages