Skip to content

Conversation

@adoroszlai
Copy link
Contributor

What changes were proposed in this pull request?

There are a few slightly different sample docker compose environments: ozone, ozoneperf, ozones3, ozone-recon. This change proposes to merge these 4 by minor additions to ozoneperf:

  1. add recon service from ozone-recon
  2. run GDPR and S3 tests
  3. expose datanode web port (eg. for profiling)

Plus: also run ozone-shell test (from basic suite), which is currently run only in ozonesecure

I also propose to rename ozoneperf to ozone for simplicity.

Consolidating these 4 environments would slightly reduce both code duplication and the time needed for acceptance tests.

https://issues.apache.org/jira/browse/HDDS-2588

How was this patch tested?

Ran acceptance test in ozone dir. Generated keys using freon, verified that Jaeger, Prometheus, Grafana reflect the operations.

Clean CI in private branch.

OZONE-SITE.XML_ozone.metadata.dirs=/data/metadata
OZONE-SITE.XML_ozone.handler.type=distributed
OZONE-SITE.XML_ozone.recon.db.dir=/data/metadata/recon
OZONE-SITE.XML_ozone.recon.om.db.dir=/data/metadata/recon
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/data/metadata/reon -> /data/metadata/om

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bharatviswa504 for spotting this. These config values from ozone-recon env as they were.

@avijayanhwx @swagle can Recon use the same directory for both ozone.recon.db.dir and ozone.recon.om.db.dir?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adoroszlai Yes the same directory can be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @avijayanhwx. Then I think we can keep it as is.

@bharatviswa504
Copy link
Contributor

bharatviswa504 commented Nov 21, 2019

Good idea, overall LGTM. One minor comment.
One more suggestion, In a similar way, we can merge ozone-om-ha and ozone-om-ha-s3 also.

@adoroszlai
Copy link
Contributor Author

Thanks @bharatviswa504 for the review.

In a similar way, we can merge ozone-om-ha and ozone-om-ha-s3 also.

Unfortunately cannot do that, since ozone-om-ha is very different (custom docker image with SSH, starts/stops OM manually, etc.).

@elek
Copy link
Member

elek commented Nov 21, 2019

Thanks to work on this @adoroszlai . I always felt the same, that we have too many environments, but I didn't know the right answer. therefore I just share my thoughts this is not a 👍 or 👎 as I don't know which one is the better.

1.THE PHILOSOPHY

It's -- at least partially -- a philosophical question, what is Ozone. (And as it's philosophy, I am interested about the opinion of our philosopher of Ozone cc @anuengineer)

By default I would say:

  • ozone = scm/om/datanode + s3g + recon
  • ozone-perf = ozone + all the tools to monitor

But we can also say what what this patch says:

  • ozone: the full ozone experience (scm/om/datanode + s3g + recon + all the monitoring tools).

(at lest for Apache Ozone, this is not true for proprietary distributions).

  1. THE USABILITY

The other question is usability: Do we need Prometheus and Jaeger all the time. Do we need to start them when we would like to test any of the features of ozone?

Prometheus: maybe
Jaeger: I am not sure.

  1. THE GUARANTEE

My third (actual) problem is the guarantee of a cluster. To avoid same flakiness I would like to declare that we need at least 3 datanodes for ozone compose clusters (see. HDDS-2606). But I am not happy with the usability that Ozone can't work if you don't do a docker-compose scale datanode=3

We may need a smaller subset of ozone (ozone-core) where the three datanodes are not required (but not used to run all the different acceptance testing). In this case the usability problem (2) can be solved as we can have a leaner ozone cluster to test locally.

@adoroszlai
Copy link
Contributor Author

Thanks @elek for your thoughts. I think (1) and (2) are addressed by the followup commit, which extracts monitoring and profiling into separate configs. (Let me know if any of the configs are miscategorized.) They can be mixed in as desired by one of:

# no COMPOSE_FILE var                                                  # => only Ozone
export COMPOSE_FILE=docker-compose.yaml:monitoring.yaml                # => add monitoring
export COMPOSE_FILE=docker-compose.yaml:profiling.yaml                 # => add profiling
export COMPOSE_FILE=docker-compose.yaml:monitoring.yaml:profiling.yaml # => add both

I need to think (or talk) about your third point (3) more.

@elek
Copy link
Member

elek commented Nov 28, 2019

I think (1) and (2) are addressed by the followup commit, which extracts monitoring and profiling into separate configs

Thanks the update @adoroszlai This approach is very smart, but I have some fear how easy is to understand it. (One additional function of the compose folders to provide simple examples to use ozone.)

But let's try out this approach. I am fine with it.

Can you please update the README.txt inside compose/ozone (currently it's the original ozoneperf readme, It can be simplified but we need to add information about the COMPOSE_FILE=... trick)?

@adoroszlai
Copy link
Contributor Author

Thanks for the feedback @elek.

Can you please update the README.txt

Sure, will do, but didn't want to write doc until the code is OK-ed. ;)

@adoroszlai
Copy link
Contributor Author

Changes in last two commits besides README update:

  1. fix freon-*.yaml:
    • versions need to match other yamls
    • monitoring config is required for getting Freon spans
  2. set safemode.min.datanode and ozone.replication to value of OZONE_REPLICATION_FACTOR environment variable
  3. add run.sh to make startup simpler (--scale datanodes based on OZONE_REPLICATION_FACTOR)

So docker-compose up without --scale still starts a cluster with 1 datanode, but ozone sh key put works OK, whereas previously it required explicit -r ONE. And getting a 3-datanode cluster is as simple as OZONE_REPLICATION_FACTOR=3 ./run.sh.

Copy link
Member

@elek elek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adoroszlai the update.

I am happy with this approach with setting the scale with OZONE_REPLICATION_FACTOR but wouldn't it be more clean to do it inside #282 / HDDS-2646 for all the environments together?

ports:
- 9876:9876
environment:
ENSURE_SCM_INITIALIZED: /data/metadata/scm/current/VERSION
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it better to move out this to a separated file? I think it's easier to overview if it's in this file. Do we really need a separated file to store this one line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hated moving it out, but the following considerations together made me do it:

  1. Variable substitution only works in the yaml files, not in configs passed via env_file. So safemode.min.datanode and ozone.replication need to be in environment.
  2. When merging common-config, one complete dict overrides the other: depending on the order either common-config or the specific service gets to define environment.

So I moved out these one-liners to avoid duplicating the "two-liners". Plus it seems unlikely that anyone ever wants to change these infrastructure related variables.

Now another alternative occurred to me: we might define a separate dict for the configs to be merged into environment. Let me experiment a bit with this. If it works, we could avoid the separate config files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the separated common configs. Especially as we have only a few lines of settings they can be included in the common configs all together. But it's not a blocker for now, we can commit it (thanks to explain the reason behind the small files...)

Copy link
Contributor Author

@adoroszlai adoroszlai Dec 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ae2bc55 moves these one-liners back to docker-compose.yaml and the separate files are no longer needed.


declare -ix OZONE_REPLICATION_FACTOR
: ${OZONE_REPLICATION_FACTOR:=1}
docker-compose up --scale datanode=${OZONE_REPLICATION_FACTOR} --no-recreate "$@"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please help me to understand why do we need --no-recreate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without --no-recreate, if I run docker-compose up a second time to start Freon after cluster is up, Docker Compose may recreate all existing containers. I noticed that this happens when running the second command from another terminal, while following logs in the original terminal. Not sure if this is really the cause, or what other conditions could trigger it. Using the flag helps avoid this situation.

Do you think it may cause problems in other cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, while I agree the call for re-create is correct here. @adoroszlai You did ask the question. The issue can happen if we have a cluster that was running that has error-ed out. Then re-running this command will not reset the system. But it is probably something that we can live with, or fix much later. I predict for a long time, when someone reports an issue, we will say " make sure you kill all running docker containers". But then, traditionally that is our first debugging step whenever someone tells us that docker based stuff is not stable for them.

Just to make sure, I am +1 and ok with this change. Just responding to your question; that is all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience the docker-compose up worked well even from other terminal if nothing has been changed and the docker-compose file set was the same.

Can we start the scm first with docker-compose up -d scm and after everything else with docker-compose up -d with this no-recreate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if nothing has been changed and the docker-compose file set was the same.

But the readme says freon compose file should be added only when datanodes are up, so the set is not the same.

https://github.com/apache/hadoop-ozone/blob/76ad638b47232761a1732281188162e5c31308d8/hadoop-ozone/dist/src/main/compose/ozoneperf/README.md#L47-L51

@adoroszlai
Copy link
Contributor Author

I am happy with this approach with setting the scale with OZONE_REPLICATION_FACTOR but wouldn't it be more clean to do it inside #282 / HDDS-2646 for all the environments together?

I'm fine with reworking this one to accommodate whatever changes are done in #282. I wanted to avoid blocking the acceptance test fix for a usability improvement.

@anuengineer
Copy link
Contributor

It's -- at least partially -- a philosophical question, what is Ozone. (And as it's philosophy, I am interested about the opinion of our philosopher of Ozone cc @anuengineer)

I am okay with what this patch says -- in reality, irrespective of what we say, there will be monitoring , tracing and logging collectors in place for most data centers. So irrespective of what we do (Prometheus, Jaeger, Grafana, Fluentd) the system admins will do the right thing for them.

We are just show casing that fact that, it is trivial to do this with Ozone. So when someone is evaluating Ozone, the question of how can I really run this service in production is answered via the presence of these tools. I would go a step ahead add these as recipes in the Ozone documentation too.

Copy link
Member

@elek elek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Let me commit it now.

It's a newer approach and we can improve it in follow-up jiras if needed.

Thanks @bharatviswa504 and @anuengineer the review and @adoroszlai the patch.

@elek elek closed this in e14f709 Dec 12, 2019
@adoroszlai
Copy link
Contributor Author

Thanks @anuengineer, @avijayanhwx, @bharatviswa504 for review, and @elek for review and commit.

@adoroszlai adoroszlai deleted the HDDS-2588 branch December 12, 2019 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants