Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Separate refactoring from #12276 in a prior PR #12296

Merged
merged 2 commits into from
Aug 27, 2018

Conversation

larroy
Copy link
Contributor

@larroy larroy commented Aug 22, 2018

Description

This is a prior refactor separated from #12276 at request from the reviewer.

Fixes --no-dockerhub-cache options and docker's --cache-from so local cache can be used on request by providing --no-dockerhub-cache.

Checked with mypy and pycharm code analyze.

@KellenSunderland

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

Copy link
Contributor

@lebeg lebeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good change

ci/build.py Outdated
each retry
:type backoff: int
"""
import time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to do the imports at the top of the file in this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it better? it's perfectly legal to do this, and some consider best practice to scope imports to the minimum scope necessary, like for example function scope.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But would it be imported every time the retry function gets called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly about this in either direction, for some functions like this is better so it's self contained. I can also move this function to util.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see. Thanks!

ci/build.py Outdated
if args.build_only:
continue
build_platform = "build_{}".format(platform)
cmd = ["/work/mxnet/ci/docker/runtime_functions.sh", build_platform]
git_cleanup()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, but maybe we should issue a warning here with confirmation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, how this would work when all platforms are build one after another? Only the latest artifacts survive?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this work if for example I want to test a new file locally that hasn't yet been added to git?

ci/build.py Outdated
build_platform = "build_{}".format(platform)
plat_buildir = os.path.abspath(os.path.join(get_mxnet_root(), '..',
"mxnet_{}".format(build_platform)))
if os.path.exists(plat_buildir):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would the path exist if git_cleanup was called before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is addressed, I removed git cleanup.

ci/build.py Outdated
@@ -383,7 +472,8 @@ def use_cache():

./build.py -a

Builds for all platforms and leaves artifacts in build_<platform>
Builds for all platforms and leaves artifacts in build_<platform>. **WARNING** it performs git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a -y for quiet confirmation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed that git cleanup.

ci/build.py Outdated
@@ -284,7 +364,7 @@ def script_name() -> str:
default=1,
type=int)

parser.add_argument("-c", "--cache", action="store_true",
parser.add_argument("--no-cache", action="store_true",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the background for this change? I'm a little concerned that changing script args could have downstream impact so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that this is changing behavior of the script and does not fall under "minor refactorings". Even if it's a oneliner - it needs a separate PR.

Copy link
Contributor Author

@larroy larroy Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will split it again then. But this doesn't affect CI given the command line used in the Jenkinsfile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too much trouble to split, as agreed in 1:1 I improved documentation about this flag.

ci/build.py Outdated
if args.build_only:
continue
build_platform = "build_{}".format(platform)
cmd = ["/work/mxnet/ci/docker/runtime_functions.sh", build_platform]
git_cleanup()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this work if for example I want to test a new file locally that hasn't yet been added to git?

ci/build.py Outdated

command = list(chain(*args.command))
docker_binary = get_docker_binary(args.nvidiadocker)
shared_memory_size = args.shared_memory_size
num_docker_build_retires = args.docker_build_retries

if args.list:
list_platforms()
print(list_platforms())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You used logger before, maybe use it here as well instead of print?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is different concerns, this prints unconditionally to stdout which is what you want. Logging can be supressed. This is intentional

ci/build.py Outdated
@@ -284,7 +364,7 @@ def script_name() -> str:
default=1,
type=int)

parser.add_argument("-c", "--cache", action="store_true",
parser.add_argument("--no-cache", action="store_true",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that this is changing behavior of the script and does not fall under "minor refactorings". Even if it's a oneliner - it needs a separate PR.

@larroy larroy force-pushed the build_py branch 2 times, most recently from c513740 to bc5d32b Compare August 23, 2018 13:19
@larroy larroy changed the title Separate minor refactoring from #12276 in a prior PR Separate refactoring from #12276 in a prior PR Aug 23, 2018
@larroy larroy force-pushed the build_py branch 2 times, most recently from dc6b876 to 2cea66c Compare August 23, 2018 13:37
ci/build.py Outdated
@@ -284,8 +288,9 @@ def script_name() -> str:
default=1,
type=int)

parser.add_argument("-c", "--cache", action="store_true",
help="Enable docker registry cache")
parser.add_argument("--no-cache", action="store_true",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the name a bit confusing, maybe --dockerhub--cache?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚡️

@larroy larroy force-pushed the build_py branch 2 times, most recently from 3b6e8aa to 04b8cd8 Compare August 23, 2018 13:42
@@ -284,8 +288,10 @@ def script_name() -> str:
default=1,
type=int)

parser.add_argument("-c", "--cache", action="store_true",
help="Enable docker registry cache")
parser.add_argument("-c", "--no-dockerhub-cache", action="store_true",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same as not providing --docker-registry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather have it explicit. but we would have to code it if we want it the way you say.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that for "default" behaviour, which I assume is use local cache (as for normal docker builds) we need to provide --no-dockerhub-cache explicitly. I think it should be the other way around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-c is not a good shortening anymore for the option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default behavior remains unchanged if I'm not mistaken. Which is to use the dockerhub cache.

Copy link
Contributor Author

@larroy larroy Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kellen. One doesn't need to specify the registry. Hence this option to disable cache which is enabled by default. I'm not sure what's the cause of confusion here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a work around for broken behaviour to me. Would you mind if we leave it out of this PR, and try to think about how to fix the root cause of the problem? To me it makes sense to not use remote caching if a user doesn't specify '--docker-registry'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s bad because we want to use cache by default. This was discussed at length with Marco before and we came to agree on this behavior. I wouldn’t want to change that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s bad because we want to use cache by default. This was discussed at length with Marco before and we came to agree on this behavior. I wouldn’t want to change that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's the information I was missing here. In this case it makes sense to me.

if instance_id:
logging.info("EC2 Instance id: %s", instance_id)
pp = pprint.PrettyPrinter(indent=4)
logging.debug("Build environment: %s", pp.pformat(dict(os.environ)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be info? The global level is at that level. Or should we have separate levels for ci and local?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's too verbose. In general I don't think we need to view this environment. I had it to diagnose. I wouldn't bother right now.

logging.getLogger().setLevel(logging.INFO)
def log_environment():
instance_id = ec2_instance_id_hostname()
if instance_id:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always true? The function returns '?' if something goes wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty string is false. In that case is not printed. For example outside of ci or locally.

try:
return f(*args, **kwargs)
except target_exception as e:
logging.warning("Exception: %s, Retrying in %d seconds...", str(e), mdelay)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we propagate the exception if we out of retries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last call to f is propagating the exception as you say.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right

Copy link
Contributor

@KellenSunderland KellenSunderland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo the new caching arguments, which I'm not still 100% understanding.

"-f", get_dockerfile(platform),
"--build-arg", "USER_ID={}".format(os.getuid()),
"--build-arg", "GROUP_ID={}".format(os.getgid())]
if use_cache:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I find this a bit of a misnomer. It sounds like we're disabling docker-cache, which is possible, but it's not what's done here. Reading the code I would expect it to pass --no-cache to the docker cmd being run. Maybe a more specific name could help?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open. Any suggestions?

cmd.extend(["--cache-from", tag])
cmd.extend(["-t", tag, get_dockerfiles_path()])

@retry(subprocess.CalledProcessError, tries=num_retries)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like this.

if not dry_run and interactive:
into_cmd = deepcopy(runlist)
# -ti can't be after the tag, as is interpreted as a command so hook it up after the -u argument
idx = into_cmd.index('-u') + 2
into_cmd[idx:idx] = ['-ti']
cmd = '\\\n\t'.join(into_cmd)
cmd = ' \\\n\t'.join(into_cmd)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this.

@sandeep-krishnamurthy
Copy link
Contributor

@lebeg - This PR is ready to go, if your concerns are addressed.

Copy link
Contributor

@lebeg lebeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, lgtm

@sandeep-krishnamurthy
Copy link
Contributor

@marcoabreu - can you take a look at this?

@sandeep-krishnamurthy sandeep-krishnamurthy added Build pr-awaiting-review PR is waiting for code review labels Aug 24, 2018
@marcoabreu
Copy link
Contributor

LGTM, thanks for iterating on this one!

@marcoabreu marcoabreu merged commit 6a7bfe9 into apache:master Aug 27, 2018
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request Sep 19, 2018
* Separate minor refactoring from apache#12276 in a prior PR

* Address CR comments
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Build pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants