Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes support? #19

Closed
mperham opened this issue Oct 24, 2017 · 25 comments
Closed

Kubernetes support? #19

mperham opened this issue Oct 24, 2017 · 25 comments

Comments

@mperham
Copy link
Collaborator

mperham commented Oct 24, 2017

What should Faktory look like in a world full of Kubernetes? My understanding is that Kubernetes could be very useful in scaling worker processes as queues grow. How can Faktory make this easy?

@jbielick
Copy link
Contributor

jbielick commented Oct 24, 2017

The newer implementations of the Horizontal Pod Autoscaler (HPA 1.6+?) support scaling on Custom Metrics (k8s 1.8+ custom metrics), for which there are some preliminary implementations. I think one interesting one is the Prometheus Adapter—if I understand correctly, k8s can pull metrics from prometheus and then an HPA can be set to scale based on a metric in that set. Perhaps in this world the responsibility of Faktory would be to have a prometheus exporter (á la oliver006/redis_exporter) that can send metics to Prometheus.

Since an exporter might live separately, I believe at a fundamental level the basic necessity would be Faktory's API for exposing processing metrics. An obvious metric for scaling might be the size of the queues, but I have also found myself interested in the amount of time a job spends in the queue before processing as well, because I believe that's also an intelligent indicator of a need to scale up workers.

Are there current ideas / plans for what internal metrics will be gathered / recorded?

@mperham
Copy link
Collaborator Author

mperham commented Oct 24, 2017

Yeah, queue size and latency are easily implementable.

What's the best approach to getting Kubernetes aware of Faktory? Are people using Helm? Would a Docker image or DEB/RPM binaries be most useful?

@syndbg
Copy link

syndbg commented Oct 24, 2017

@mperham Definitely Docker image would enable faktory to be used everywhere. Be that Kubernetes, Mesos Marathon/Aurora, Nomad and etc.

@rosskevin
Copy link

We run google container engine a.k.a. GKE (managed kubernetes) for all our services which include sidekiq workers. We have a situation where we may have high flash traffic (pdf rendering) where we are going to use google cloud functions a.k.a. GCF so we don't worry about scale. So sidekiq handles typical jobs, GCF handles high scale/flash traffic jobs like rendering.

With the introduction of faktory, I think one approach/architecture for pain-free scaling:

  • kubernetes service + deployment for faktory
  • kubernetes deployment for ruby worker (migrate sidekiq jobs)
  • google cloud functions (GCP's answer for serverless) for something like faktory_worker_node_gcf worker (no scale/cluster management needed)

Faktory needs proper probes:

  • readiness - ready for traffic
  • liveness - lightest ping possible

Preparing for kubernetes:

  • some people use helm, but I haven't found it very heavily trafficked. Nonetheless, a reference helm config helps others develop their own configs (and can be used as a quickstart)
  • many are using spinnaker for continuous deployment (e.g. based on vanilla kubernetes yaml files)
  • optimized docker image would be most useful - e.g. redis-alpine
  • persistent storage of faktory itself and ensuring recovery/availability will be most important concern for most.

Last but not least, consider an equivalent service to redislabs.com. We use GKE because we want to offload as much management as possible to focus on app development. For that reason, we didn't even try to deploy redis inside our cluster, but we used redislabs.com to deploy in the same region as our GKE cluster. Feel free to ping me via email if you want to dig into any of these areas.

@mperham
Copy link
Collaborator Author

mperham commented Oct 25, 2017

@rosskevin Thanks for the info, that's great detail. Running a Faktory SaaS is not in my immediate plans but like antirez and Redislabs, I've already had pings from people interested in partnering in building something and I'm always willing to chat: mike @ contribsys.com

I'm still trying to wrap my head around what an enterprise-ready Faktory would look like and require; this helps a lot.

@gmaliar
Copy link

gmaliar commented Oct 27, 2017

@rosskevin @mperham a few notes to add about k8s support, let me know what you think and if I'm getting anything wrong.

contribsys/faktory is the master process that also holds the rocksdb database, remember that k8s might kill faktory pod at will, so there should be a service with both web and worker ports open connected to the faktory-master deployment.
the deployment itself should have a persistent volume claim, so that the actual database file will be mounted on that volume, otherwise you'd lose all jobs waiting to be fulfilled,
this looks okay

// undocumented on purpose, we don't want people changing these if possible
as it allows you to specify in the deployment yaml under the command key but it might be a better approach to allow everything to be specified as environment variables, i.e password, port, storage directory, no-tls etc.

I am a heavy helm user, it's pretty easy to set up a basic chart that will take care of setting up the faktory server, but workers might need to be as a different chart, so upgrading of the worker chart won't require a redeploy of the faktory server.

let me know if you need any help with PRs around these areas ...

@mickume
Copy link

mickume commented Dec 1, 2017

I will look into configuration & deployment for OpenShift ... similar to Kubernetes ...

@contribsys contribsys deleted a comment from vamsi248 Dec 1, 2017
@contribsys contribsys deleted a comment from vamsi248 Dec 1, 2017
@mperham
Copy link
Collaborator Author

mperham commented Dec 1, 2017

Gonna close this because I'm not a fan of nebulous, open-ended issues. Please open a specific issue if there's something Faktory can do to make our k8s support better.

@mperham mperham closed this as completed Dec 1, 2017
@kirs
Copy link

kirs commented Apr 24, 2019

@mperham I was going to open an issue regarding Kubernetes support, and I discovered you have one already open!

Specifically, I'd like to see examples of Kubernetes YAMLs that people could copy, paste, kubectl apply and have Faktory running.

This would be especially cool to have for zero-downtime deploy of Faktory, as well as for Redis Gateway and replicated Faktory Pro.

@jbielick
Copy link
Contributor

jbielick commented Apr 24, 2019

This could be a starting point.

Couple of things to note:

  • I don't know if I would consider it production-ready, but it's a stateful set of 1 faktory server (embedded redis). A redis gateway would be a much better idea for production usage (redundancy, failover being proper concerns).
  • It's configured for Google Cloud's Kubernetes offering (GKE)
  • This definition expects and creates within a faktory namespace
  • The container limits are super low so you'll want to tweak those: 300m CPU, 1Gi Memory
  • It also depends on a secret being created for the faktory password. You can create it with this:
  • It assumes a persistent disk in us-east1-b is ok (change this to your cluster's zone if necessary)
kubectl --namespace=faktory create secret generic faktory --from-literal=password=yoursecurepassword
apiVersion: v1
kind: Namespace
metadata:
  name: faktory
---
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
  namespace: faktory
  name: fast
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd
  zone: us-east1-b
---
apiVersion: v1
kind: Service
metadata:
  namespace: faktory
  labels:
    name: faktory
  name: faktory
spec:
  ports:
    - name: faktory
      protocol: TCP
      port: 7419
      targetPort: 7419
  selector:
    app: faktory
---
apiVersion: v1
kind: ConfigMap
metadata:
  namespace: faktory
  name: faktory-conf
data:
  cron.toml: |
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  namespace: faktory
  name: faktory
spec:
  serviceName: faktory
  replicas: 1
  template:
    metadata:
      namespace: faktory
      labels:
        app: faktory
    spec:
      terminationGracePeriodSeconds: 10
      containers:
        - name: faktory
          image: contribsys/faktory:0.9.6
          command:
            - /faktory
            - -b
            - :7419
            - -w
            - :7420
            - -e
            - production
          resources:
            requests:
              cpu: 300m
              memory: 1Gi
          env:
            - name: FAKTORY_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: faktory
                  key: password
          ports:
            - containerPort: 7419
              name: faktory
          volumeMounts:
            - name: faktory-data
              mountPath: /var/lib/faktory/db
            - name: faktory-conf
              mountPath: /etc/faktory/conf.d
      volumes:
        - name: faktory-conf
          configMap:
            name: faktory-conf
            items:
              - key: cron.toml
                path: cron.toml
  volumeClaimTemplates:
    - metadata:
        name: faktory-data
        namespace: faktory
        annotations:
          volume.beta.kubernetes.io/storage-class: fast
      spec:
        accessModes: [ "ReadWriteOnce" ]
        resources:
          requests:
            storage: 10Gi

A helm chart might be the right call for a more configurable and standardized deployment.

@dm3ch
Copy link

dm3ch commented May 19, 2019

Here's a helm chart helm/charts#13974

@ecdemis123
Copy link

ecdemis123 commented Sep 11, 2019

Here's the resources I used to get the Faktory server fully set up. I hope that this can help someone else in the future, as @jbielick 's example deployment yml was a great help to me.

A note about this configuration, we are using Datadog deployed as a daemon set + Kubernetes, so this setup will allow you to use the "pro metrics" statsd implementation. The references to DD_TRACE_AGENT_HOSTNAME are how we access the datadog agent.

Configs created:
kubectl create secret generic faktory --from-literal=password=${your_password} --from-literal=username=${your_username} --from-literal=license=${your_license}

kubectl create configmap faktory-config-merged --from-file=cron.toml=path/to/file/cron.toml --from-file=statsd.toml=path/to/file/statsd.toml -o yaml --dry-run=true | kubectl apply -f -

Server Deployment:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: faktory
  labels:
    run: faktory
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 100%
  template:
    metadata:
      labels:
        run: faktory
        name: faktory
      annotations:
        key: value
    spec:
      containers:
        - name: faktory
          image: 'location of pro dockerfile'
          command: [ "/faktory"]
          # must use "production" flag for cron
          args: ["-w", ":7420", "-b", ":7419", "-e", "production"]
          ports:
            - containerPort: 7419
            - containerPort: 7420
          resources:
            requests:
              cpu: 300m
              memory: 1Gi
          env:
            - name: FAKTORY_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: faktory
                  key: password
            - name: FAKTORY_USERNAME
              valueFrom:
                secretKeyRef:
                  name: faktory
                  key: username
            - name: FAKTORY_LICENSE
              valueFrom:
                secretKeyRef:
                  name: faktory
                  key: license
            - name: DD_TRACE_AGENT_HOSTNAME
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
          volumeMounts:
            - name: faktory-data
              mountPath: /var/lib/faktory/db
            - name: shared
              mountPath: /etc/faktory/conf.d
            - name: faktory-config-merged
              mountPath: /merged/
      initContainers:
        - name: config-data
          image: busybox
          command:
            - sh
            - -c
            - cp -a /merged/. /shared/ && sed -i 's/localhost/'"$DD_TRACE_AGENT_HOSTNAME"'/g' /shared/statsd.toml && ln -sf /merged/cron.toml /shared/cron.toml
          env:
            - name: DD_TRACE_AGENT_HOSTNAME
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
          volumeMounts:
            - name: faktory-config-merged
              mountPath: /merged/
            - name: shared
              mountPath: /shared/
      volumes:
        - name: faktory-config-merged
          configMap:
            name: faktory-config-merged
            items:
              - key: cron.toml
                path: cron.toml
              - key: statsd.toml
                path: statsd.toml
        - name: shared
          emptyDir: {}
        - name: faktory-data
          persistentVolumeClaim:
            claimName: faktory-data
---
kind: Service
apiVersion: v1
metadata:
  name: faktory
  labels:
    run: faktory
spec:
  selector:
    run: faktory
  ports:
  - name: network
    protocol: TCP
    port: 7419
    targetPort: 7419
  - name: webui
    protocol: TCP
    port: 80
    targetPort: 7420

Persistent Volume Claim:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: faktory-data
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi

This is a script for sending the HUP signal to the Faktory server to reset the cron when new jobs are added:

#!/bin/bash

# This script reloads the configuration of the Faktory job server deployment
# by sending the HUP signal to the Faktory docker container
# https://github.com/contribsys/faktory/wiki/Pro-Cron

echo "Updating Faktory cron config and reloading Faktory server config"

PROJECT_DIR="$( cd "$( dirname "$DIR" )" && pwd )"

kubectl create configmap faktory-cron-config --from-file=cron.toml=${PROJECT_DIR}/path/to/file/cron.toml -o yaml --dry-run | kubectl apply -f -

POD=`kubectl get --no-headers=true pods -l name=faktory -o custom-columns=:metadata.name`

if [[  -z "$POD" ]] ; then
    echo "No Faktory Pods Found"
    exit 1
fi

echo "Faktory pod: ${POD}"

kubectl exec -it ${POD} -c=faktory -- /bin/kill -HUP 1

echo "Done"
exit 0

I hope this helps someone else with their setup!

@jbielick
Copy link
Contributor

@dm3ch I saw that the incubator PR was closed. Did you end up adding your chart / repo to the helm hub? I couldn't find it. Would you be opposed to me using some of the files from your PR to make a chart and publish?

@ecdemis123 this is great. I assume this is a production setup? Glad you figured out the Datadog DaemonSet connection. I'll definitely be referencing that at some point to get it working correctly in one of our clusters. That initContainer is a clever workaround for interpolating the ENV var in the config.

I think I'll start on a helm chart, which could automate sending the SIGHUP (is it not a USR1 signal?) upon changing of a cron config. I'll post here if I have some progress to show. Is baking in some datadog support a good idea? I know Sidekiq does this and I could borrow @ecdemis123's solution there.

@ecdemis123
Copy link

Cheers! Yep, this is our production setup, however our staging setup is pretty much the same with the exception that it's running faktory in development mode.

I'll be interested to see the helm chart. We aren't currently using helm but we hope to move onto it someday.

@ecdemis123
Copy link

Actually, that HUP command is not working to restart the cron. I suspect that its due to this kubernetes/kubernetes#50345 since I'm using subPath to mount the conf.d files. Gonna dig more into it and see if I can come up with a workaround

@ecdemis123
Copy link

I updated my deployment manifest above with a working version that will accept the HUP signal and update the cron schedule from a k8s config map. Everything seems to be working good in production now.

@jbielick
Copy link
Contributor

@ecdemis123 You might consider writing the ConfigMap like so:

apiVersion: v1
kind: ConfigMap
metadata:
  name: faktory
data:
  cron.toml: |
    [[cron]]
      schedule = "*/5 * * * *"
      [cron.job]
        type = "FiveJob"
        queue = "critical"
        [cron.job.custom]
          foo = "bar"

    [[cron]]
      schedule = "12 * * * *"
      [cron.job]
        type = "HourlyReport"
        retry = 3

    [[cron]]
      schedule = "* * * * *"
      [cron.job]
        type = "EveryMinute"
  faktory.toml: ""
  test.toml: ""

Where each key is a file, the value its string contents.

I found this pretty convenient when mounting to the pod:

# ...
            - name: faktory-configs
              mountPath: /etc/faktory/conf.d
# ...
      volumes:
        - name: faktory-configs
          configMap:
            name: faktory
› kubectl exec faktory-0 -- ls -al /etc/faktory/conf.d
total 12
drwxrwxrwx    3 root     root          4096 Sep 14 23:21 .
drwxr-xr-x    1 root     root          4096 Sep 14 23:14 ..
drwxr-xr-x    2 root     root          4096 Sep 14 23:21 ..2019_09_14_23_21_59.171278579
lrwxrwxrwx    1 root     root            31 Sep 14 23:21 ..data -> ..2019_09_14_23_21_59.171278579
lrwxrwxrwx    1 root     root            16 Sep 14 23:14 cron.toml -> ..data/cron.toml
lrwxrwxrwx    1 root     root            19 Sep 14 23:14 faktory.toml -> ..data/faktory.toml
lrwxrwxrwx    1 root     root            16 Sep 14 23:14 test.toml -> ..data/test.toml

Change to the ConfigMap get pushed to the pod in less than a minute (10s in my case).

As a result, the total delay from the moment when the ConfigMap is updated to the moment when new keys are projected to the pod can be as long as kubelet sync period (1 minute by default) + ttl of ConfigMaps cache (1 minute by default) in kubelet.

https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/#mounted-configmaps-are-updated-automatically

@jbielick
Copy link
Contributor

You might be able to write the statds.toml in the init container and it won't get overwritten. That's kind of a tricky one :\

The reloading (kill -HUP 1) seems to work (once the files were fully pushed out). I'm thinking it might be a good idea to have a sidecar container watching those files and when they change, send a signal to the other container in the pod.

I'll do an experiment and report back. This ability is mentioned here: https://kubernetes.io/docs/tasks/configure-pod-container/share-process-namespace/

@ecdemis123
Copy link

ecdemis123 commented Sep 16, 2019

Interesting. My config map was created from a file, not using a string, so I wonder if that would simplify the implementation a bit. Mine looks pretty similar to yours, but I'm not that familiar with configmaps to see any subtle differences. I like the idea of having a sidecar container to send the HUP signal. I added that HUP script referenced above to our deploy pipeline that will get ran manually if necessary.

kubectl describe configmap faktory-config-merged
Name:         faktory-config-merged
Namespace:    default
Labels:       <none>
Annotations:  "truncated for now"

Data
====
cron.toml:
----
[[cron]]
  schedule = "12 * * * *"
  [cron.job]
    type = "TestDatadog"
    retry = 1
[[cron]]
  schedule = "*/5 * * * *"
  [cron.job]
    type = "UpdateChatAgentCount"
    retry = 0

statsd.toml:
----
[statsd]
  # required, location of the statsd server
  location = "localhost:8125"

  # Prepend all metric names with this value, defaults to 'faktory.'
  # If you have multiple Faktory servers for multiple apps reporting to
  # the same statsd server you can use a multi-level namespace,
  # e.g. "app1.faktory.", "app2.faktory." or use a tag below.
  namespace = "faktory."

  # optional, DataDog-style tags to send with each metric.
  # keep in mind that every tag is sent with every metric so keep tags short.
  tags = ["env:production"]

  # Statsd client will buffer metrics for 100ms or until this size is reached.
  # The default value of 15 tries to avoid UDP packet sizes larger than 1500 bytes.
  # If your network supports jumbo UDP packets, you can increase this to ~50.
  #bufferSize = 15

Events:  <none>

@ecdemis123
Copy link

I realized that I also could have posted our worker deployment yml. This just a basic implementation and I've removed company-specific info.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: faktory-worker
  labels:
    run: faktory-worker
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 50%
  template:
    metadata:
      labels:
        run: faktory-worker
    spec:
      nodeSelector:
        cloud.google.com/gke-nodepool: task-pool
      volumes:
        - name: cloudsql
          emptyDir: null
      containers:
        - name: cloudsql-proxy
          image: 'gcr.io/cloudsql-docker/gce-proxy:1.10'
          command:
            - '/cloud_sql_proxy'
            - '--dir=/cloudsql'
            - '-instances=instance-id'
          volumeMounts:
            - name: cloudsql
              mountPath: /cloudsql
          lifecycle:
            preStop:
              exec:
                command: ['/bin/sh', '-c', '/bin/sleep 30']
        - name: faktory-worker
          imagePullPolicy: Always
          image: 'our repo docker image'
          command: ['node', 'dist/src/jobs/faktory/index.js']
          volumeMounts:
            - name: cloudsql
              mountPath: /cloudsql
          env:
            - name: FAKTORY_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: faktory
                  key: password
            - name: FAKTORY_USERNAME
              valueFrom:
                secretKeyRef:
                  name: faktory
                  key: username
            - name: DB_SOCKETPATH
              value: /cloudsql/socket-path
            - name: FAKTORY_URL
              value: "tcp://:$(FAKTORY_PASSWORD)@faktory:7419"
            - name: DEBUG
              value: faktory*
            - name: DD_SERVICE_NAME
              value: faktory-worker
            - name: DD_TRACE_AGENT_HOSTNAME
              valueFrom:
                fieldRef:
                  fieldPath: status.hostIP
          envFrom:
            - secretRef:
                name: #### company secrets
            - configMapRef:
                name: ### company config

@jbielick
Copy link
Contributor

jbielick commented Dec 10, 2019

🎉 Chart now available on the helm hub 🎉

adwerx/faktory

helm repo add adwerx https://adwerx.github.io/charts
helm install --name faktory adwerx/faktory

Datadog Agent HostIP support coming soon.

@scottrobertson
Copy link

It it be useful if i PR'ed some example kubernetes configs, or add them to the wiki?

@mperham
Copy link
Collaborator Author

mperham commented Jan 17, 2020 via email

@dm3ch
Copy link

dm3ch commented Jan 18, 2020

@dm3ch I saw that the incubator PR was closed. Did you end up adding your chart / repo to the helm hub? I couldn't find it. Would you be opposed to me using some of the files from your PR to make a chart and publish?

@ecdemis123 this is great. I assume this is a production setup? Glad you figured out the Datadog DaemonSet connection. I'll definitely be referencing that at some point to get it working correctly in one of our clusters. That initContainer is a clever workaround for interpolating the ENV var in the config.

I think I'll start on a helm chart, which could automate sending the SIGHUP (is it not a USR1 signal?) upon changing of a cron config. I'll post here if I have some progress to show. Is baking in some datadog support a good idea? I know Sidekiq does this and I could borrow @ecdemis123's solution there.

Sorry for long reply. Suddenly I haven't got yet time to publish my chart in separate repo.
Thank you for publishing chart in helm hub. If you chart contains pieces of my work it's completely ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants