Skip to content

Conversation

@gouenji-shuuya
Copy link

  • Move to bash as that was implicitly expected (ref. #vidyut on Discord)
  • Some refactoring.
  • Sub-make is correctly called when using make create_all_data.
  • Use -jnproc in make.
  • Ignore venv in git.

@gouenji-shuuya gouenji-shuuya force-pushed the make-fix branch 2 times, most recently from f38da63 to df9c149 Compare March 12, 2023 12:31
@akprasad
Copy link
Contributor

Thank you for this! Forgive the late look.

Copy link
Contributor

@akprasad akprasad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much!

$ git clone https://github.com/ambuda-org/vidyut.git
$ cd vidyut
$ make test
$ make -j`nproc` test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that -j controls the number of make recipes that run in parallel. If so, what is the benefit of using -j here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cargo can respect the jobserver settings. Refer rust-lang/rust#42682.

Even if the workflow is currently serial, there is a possibility of decoupling steps in future for faster builds.

For instance, create_kosha executes successfully before create_sandhi_rules, so there is no dependency (though the latter is quick). I think the cloning can also be parallelized, by use of recipes like this:

mkdir build

get_corpus_data:
	@if [[ -e "data/raw/dcs" ]]; then 				\
		echo "Training data already exists -- skipping fetch.";	\
	else 								\
		echo "Training data does not exist -- fetching.";	\
		mkdir -p "data/raw/dcs";				\
		git clone https://github.com/OliverHellwig/sanskrit.git	\
			--depth=1 build/dcs-data;			\
		mv build/dcs-data/dcs/data/conllu data/raw/dcs/conllu;	\
	fi


get_linguistic_data:
	@if [[ -e "data/raw/lex" ]]; then 				\
		echo "Lexical data already exists -- skipping fetch.";	\
	else 								\
		echo "Lexical data does not exist -- fetching.";	\
		mkdir -p "data/raw/lex";				\
		git clone https://github.com/sanskrit/data.git		\
			--depth=1 build/data-git;			\
		python3 build/data-git/bin/make_data.py 		\
			--make_prefixed_verbals;			\
		mv build/data-git/all-data/* data/raw/lex;		\
	fi

So the create_all_data.sh file really is a bottleneck. There are several make commands there and they execute one by one. Maybe a bit of parallelization will help, like in test/train?

- Move to bash as that was implicitly expected (ref. #vidyut on Discord)
- Some refactoring.
- Sub-make is correctly called when using make create_all_data.
- Use -j`nproc` in make.
- Ignore venv in git.
@akprasad
Copy link
Contributor

Sorry for not getting to this earlier!

Given the repo's evolution, there are some conflicts now, so I'll close the PR for the sake of housekeeping. If you're still interested, please resolve the conflicts and reopen the PR.

@akprasad akprasad closed this Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants