Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No matching certificate to load: decoding certificate metadata: invalid character '}' #6481

Closed
solracsf opened this issue Jul 27, 2024 · 30 comments
Labels
bug 🐞 Something isn't working
Milestone

Comments

@solracsf
Copy link
Contributor

Caddy v2.8.4 started printing an error for one domain:

Jul 27 12:06:59 caddy-proxy-sbg7 caddy[1208]: {"level":"info","ts":1722074819.7958932,"logger":"tls.on_demand","msg":"obtaining new certificate","remote_ip":"77.141.97.142","remote_port":"49984","server_name":"server.example.com"}
Jul 27 12:06:59 caddy-proxy-sbg7 caddy[1208]: {"level":"error","ts":1722074819.8085268,"logger":"tls.on_demand","msg":"loading newly-obtained certificate from storage","remote_ip":"77.141.97.142","remote_port":"49984","server_name":"server.example.com","error":"no matching certificate to load for server.example.com: decoding certificate metadata: invalid character '}' after top-level value"}
{
        "sans": [
                "server.example.com"
        ],
        "issuer_data": {
                "url": "https://acme-v02.api.letsencrypt.org/acme/cert/042fb47e52c51ab43af43257f1ba5e8",
                "ca": "https://acme-v02.api.letsencrypt.org/directory",
                "renewal_info": {
                        "suggestedWindow": {
                                "start": "2024-09-23T06:58:45.333333334Z",
                                "end": "2024-09-25T06:58:45.333333334Z"
                        },
                        "_uniqueIdentifier": "kydGmAOpUWiOmNbEI.BC-0flLFGrQ69DL_idV_G6Xo",
                        "_retryAfter": "2024-07-27T09:40:01.93861755+02:00",
                        "_selectedTime": "2024-09-24T03:22:43Z"
                }
        }
}}

Removing the extra } at the bottom of the file /caddy/certificates/acme-v02.api.letsencrypt.org-directory/server.example.com/server.example.com.json fixes the error.

@mholt
Copy link
Member

mholt commented Jul 27, 2024

Hey Skifree, thanks for reporting this. 😉

That's weird! @elee1766 Is this possibly due to a race condition you were referring to in Slack? I don't think I've seen a malformed JSON file before, only an empty one.

@mholt mholt added the bug 🐞 Something isn't working label Jul 27, 2024
@elee1766
Copy link
Contributor

elee1766 commented Jul 27, 2024

this looks like one file is written on top of another without clearing the first once.

if read is called after fsync is called after truncate and after write but before flush/close, this could happen I think? the new cert would have to be one character smaller than the old cert. (is this possible? are any fields variable length?).

but it's reading the old file size, so maybe something else happened at write time. regardless, something happened during file write, maybe also during a read.

but yes, my guess is the temp file approach fixes this @mholt

I doubt go's json encoder would do this.

@mholt
Copy link
Member

mholt commented Jul 27, 2024

@elee1766 Thanks for your input! Yeah, I think the _uniqueIdentifier may possibly vary by 1-2 chars. But I'm not 100% sure on that. It's related to the ASN.1 encoding of the cert's serial number.

But anyway, it sounds like we have a fix in the pipeline for this already. :)

@mholt
Copy link
Member

mholt commented Jul 27, 2024

@solracsf I don't suppose you happen to have the previous JSON file for this cert?

@elee1766
Copy link
Contributor

elee1766 commented Jul 27, 2024

also, did anything weird happen to the server ? like a reboot or process restart or config reload, when it started happening? also, are you running only one instance of caddy?

there's also a third variable - maybe two certs could have been being written at once somehow. iirc the lock in filestorage is not perfect. @mholt isn't this possible? (for the imperfect file based lock to cause parallel write?)

@solracsf
Copy link
Contributor Author

solracsf commented Jul 28, 2024

I have 2 caddy instances, 2 distinct servers. Caddy storage is a JuiceFS mount.
JuiceFS supports both BSD locks (flock) and POSIX record locks (fcntl).

Problem started when servers rebooted (manually, one after another, to apply kernel updates). This has been done before many times without any issues.

# GLOBAL options
{
        email [email protected]

        storage file_system {
                root /mnt/caddyvol/caddy
        }

        servers {
                strict_sni_host on
                protocols h1 h2
                trusted_proxies_strict
        }

        on_demand_tls {
                ask http://ask.localhost/check
        }
}

@elee1766
Copy link
Contributor

elee1766 commented Jul 28, 2024

@mholt yes, at this point i'm convinced the tempfile patch would make this not happen. it looks like two caddy instances were writing to the same file at the same time.

@solracsf here is the relevant PR, if you are curious caddyserver/certmagic#300

from what i see - juicefs rename is atomic? so this PR should work for you.

btw, certmagic storage doesn't use advisory locks yet. i opened an issue about this caddyserver/certmagic#295 . I personally dont have a use case for this, since i do not use the filesystem to share certs across multiple instances, so i didn't end up finishing this.

if you have the time it would be nice to add this as a feature. matt said he is happy to accept a PR to correctly use filesystem level locks instead of the fake-lock that filesystem storage currently uses. I would be happy to review as well. basically just need to check on boot if the filesystem supports a locking method, and use it, otherwise fallback to the existing strategy

that said, i would seriously recommend just using redis with https://github.com/pberkel/caddy-storage-redis or https://github.com/ss098/certmagic-s3 with juicefs s3 gateway, or writing your own cert-magic plugin for s3 for any production settings.

@solracsf
Copy link
Contributor Author

solracsf commented Aug 2, 2024

that said, i would seriously recommend just using redis with https://github.com/pberkel/caddy-storage-redis or https://github.com/ss098/certmagic-s3 with juicefs s3 gateway, or writing your own cert-magic plugin for s3 for any production settings.

I understand, but we prefer not adding plugins to caddy. In our env, JuiceFS handles terabytes of data without any issue since 2y now, we prefere keep it as our filesystem abstraction layer and rely on Caddy certmagic filesystem.

We'll keep an aye on caddyserver/certmagic#300 and provide feedback. 👍

@mholt
Copy link
Member

mholt commented Aug 5, 2024

@solracsf You can try it now, if you build Caddy with xcaddy build --with github.com/caddyserver/certmagic=github.com/caddyserver/certmagic@master.

@solracsf
Copy link
Contributor Author

solracsf commented Aug 6, 2024

Will be hard to say if it fixes anything soon, because caddy was installed and runing for 2y with this setup without any issues 🤔
Nevertheless, I've built our 2 servers with latest certmagic@master and report back if this happens again.
Feel free to close this issue for now.

@mholt mholt closed this as completed Aug 23, 2024
@lqs
Copy link

lqs commented Aug 29, 2024

If you are running multiple instances of Caddy with shared storage, you may encounter a locking issue introduced in certmagic v0.21.

See caddyserver/certmagic#303

@mholt
Copy link
Member

mholt commented Aug 29, 2024

Yeah, sorry; I'm behind on releases. Trying to catch up on things!

@aa-shalin
Copy link

aa-shalin commented Oct 15, 2024

hey @mholt, i'm facing the exact same issue. I recently saw some JSON files containing an additional } at the end of file which led to this error being thrown by Caddy

{"error":"no matching certificate to load for insights.example.se: decoding certificate metadata: invalid character '}' after top-level value", "level":"debug", "logger":"tls.handshake", "msg":"did not load cert from storage", "remote_ip":"10.142.0.147", "remote_port":"11452", "server_name":"insights.example.se", "ts":1.7288959881120608E9}

Do we have any fix for this issue? I'm already using Caddy v2.8.4 and running multiple instances of Caddy with a shared NFS volume.

Would I be able to solve this issue by moving to Google Cloud Storage/S3?

@francislavoie
Copy link
Member

francislavoie commented Oct 15, 2024

The workaround is to wipe out Caddy's stroage and restart Caddy. It'll reissue all your certificates, but that should be fine assuming you have only a handful. Or, you can dive into Caddy's storage and look at all the .json files, fix their syntax manually if possible (or delete the broken ones to get Caddy to regenerate them).

In v2.9.0 it should be fixed by caddyserver/certmagic#300

@aa-shalin
Copy link

The workaround is to wipe out Caddy's stroage and restart Caddy. It'll reissue all your certificates, but that should be fine assuming you have only a handful.

In v2.9.0 it should be fixed by caddyserver/certmagic#300

Thank you @francislavoie! I'll upgrade to v2.9.0 and check all the JSON files as i have > 1k certs.

@aa-shalin
Copy link

@francislavoie I see that v2.9.0 is still in beta, could I build v2.8.4 with certmagic@mater instead? i'm already building Caddy binary using xcaddy

@francislavoie
Copy link
Member

Upgrading won't fix the broken storage, it'll just fix further breakages of the storage. You could use the beta, it's pretty stable, just missing a few things we wanted to fit into the final release.

@ssc-ksaitou
Copy link

ssc-ksaitou commented Nov 8, 2024

I have faced this issue too in this morning after upgrading Caddy to v2.8.4 and OS got rebooted.
That Caddy has multiple sites but one site had showed this log messages continuously after upgrading.

I solved this by removing the site's entire certificate cache dir (such as mv /var/lib/caddy/.local/share/caddy/certificates/acme-v02.api.letsencrypt.org-directory/mysite.example.com{,_bak} in Ubuntu) and restarted Caddy.
But I think it's OK just to fix broken JSON.

@bryanus
Copy link

bryanus commented Nov 13, 2024

Ran into this today also. Deleted the cert from storage and resrted Caddy, and it issued a new cert successfully. Waiting patiently for 2.9.0!

@francislavoie
Copy link
Member

@bryanus you can use 2.9.0-beta.3 right now

@bryanus
Copy link

bryanus commented Nov 25, 2024

Just found another cert in my system that had this issue. Could someone remind me on how to perform an upgrade to the 2.9.0-beta.3? And do I need to wipe all my certs first to have them reissued? TIA.

@elee1766
Copy link
Contributor

elee1766 commented Nov 25, 2024

@bryanus

And do I need to wipe all my certs first to have them reissued? TIA.

you do not need to wipe all of your certs. I assume that on error, the certs will refetch? maybe @mholt can confirm though .

it may be good to go through the certs and delete any of them that are malformed json.

Could someone remind me on how to perform an upgrade to the 2.9.0-beta.3?

depends on how you are running, but you just need to switch to running the new binary / install the new package from https://github.com/caddyserver/caddy/releases/tag/v2.9.0-beta.3 and restart.

@aa-shalin
Copy link

aa-shalin commented Nov 26, 2024

I wrote this script to find all malformed JSON files

# Directory containing the subdirectories
base_dir="/data/caddy/certificates/acme-v02.api.letsencrypt.org-directory"

# Output file to record JSON files ending with "}}\n}" or "}}"
output_file="valid_json_files.txt"

# Check if output file already exists and remove it to start fresh
if [ -f "$output_file" ]; then
    rm "$output_file"
fi

# Navigate through each subdirectory
for dir in "$base_dir"/*; do
    if [ -d "$dir" ]; then
        # Loop through each JSON file in the directory
        for json_file in "$dir"/*.json; do
            if [ -f "$json_file" ]; then
                # Use awk to check if the file ends with "}}\n}"
                last_chars=$(tail -n 2 "$json_file" | awk '/}}/{if(getline == 1 && /^}$/) print "yes";}')
                if [[ "$last_chars" == "yes" ]]; then
                    # Record the file path in the output file
                    echo "$json_file" >> "$output_file"
                fi
                if tail -c 2 "$json_file" | grep -q "}}"; then
                    # Record the file path in the output file
                    echo "$json_file" >> "$output_file"
                fi
            fi
        done
    fi
done

echo "Script execution completed. Check '$output_file' for results."

@mholt
Copy link
Member

mholt commented Nov 26, 2024

I assume that on error, the certs will refetch?

Caddy will try again later if there's an error.

it may be good to go through the certs and delete any of them that are malformed json.

That will help speed things up, I'm pretty sure.

Thanks for helping @elee1766 !

And thanks for posting your script @aa-shalin

@rainerborene
Copy link
Contributor

rainerborene commented Dec 13, 2024

Same issue here. I'm running a Caddy instance on a single machine with certs stored on a ext4 partition.

Here is a one line version of @aa-shalin script that fixes this issue as well:

$ grep -l "}}" **/*.json | xargs -I {} sed -i 's/}}/}/g' {}

@mholt Do you have any estimates for the release of version 2.9.0?

@mholt
Copy link
Member

mholt commented Dec 13, 2024

Shortly after new years. One more beta before then I think.

@evolross
Copy link

evolross commented Dec 16, 2024

This issue likely killed a production server of one of our app's main services. We have about 150 domains/certs, most of them not used very often. But the } bug finally hit our default domain cert causing a DOS for customers. Though upon seeing this thread we found the bug in about four of the 150 certs.

We're running a single instance of Caddy 2.8.4 in an AWS EC2 instance with an 8GB AWS gp3 volume.

For logs, these two lines start and continue repeating:

Dec 13 18:46:05 ip-12-3-4-567.ec2.internal caddy[123456]: {"level":"info","ts":1734115565.8264415,"logger":"tls","msg":"certificate needs renewal based on ARI window","subjects":["our.domain.com"],"expiration":17342566>
Dec 13 18:46:05 ip-12-3-4-567.ec2.internal caddy[123456]: {"level":"warn","ts":1734115565.826408,"logger":"tls.on_demand","msg":"stapling OCSP","server_name":"our.domain.com","sans":["our.domain.com"],"error":"i>

Then the first log of the brace error occurs and repeats every few seconds:

Dec 13 18:46:06 ip-12-3-4-567.ec2.internal caddy[123456]: {"level":"error","ts":1734115566.8684826,"logger":"tls.renew","msg":"will retry","error":"decoding certificate metadata: invalid character '}' after top-level value",">

Does the ts value tie this to the cert? Otherwise, I'm not 100% positive this is for our default domain but it lines up timewise.

Overall, we've been using Caddy for two years with no problems. This particular instance has been running since July 2024 when we upgraded to 2.8.4 with no issues, reboots, or restarts.

Kind of a wild bug to still be floating around in a current release. Fixing the bad certs fixed the issue. Also anxiously awaiting a new version. But hopefully the fix doesn't assume multiple Caddy instances as the problem.

@mholt mholt added this to the v2.9.0 milestone Dec 16, 2024
@elee1766
Copy link
Contributor

elee1766 commented Dec 16, 2024

Kind of a wild bug to still be floating around in a current release. Fixing the bad certs fixed the issue. Also anxiously awaiting a new version. But hopefully the fix doesn't assume multiple Caddy instances as the problem.

@evolross

the fix doesn't assume multiple caddy instances as a problem, the current beta (and/or next version if you dont wish to use the beta) should resolve issues for both replicated and single-instance deploys. We have other users who have reported the same bug on single-instance deploys as well.

@mholt
Copy link
Member

mholt commented Dec 18, 2024

@evolross Those log messages are not related AFAICT (but it's hard since they seem truncated).

The error reported here is caused, as far as we know, by an incomplete write, often due to power loss, process termination, out of disk space, and other external factors. I have even seen failed writes where the OS did not return an error, so there's no way Caddy could know.

@elee1766 's patch is helpful because it writes the contents to a separate file, then renames only once the entire contents have been written, rather than opening the file in truncation mode (effectively deleting the contents) and hoping it writes successfully.

I'll be releasing one more beta (hopefully just one) before the new year, and then 2.9 after the new year.

@solracsf
Copy link
Contributor Author

Can confirm 2.9.0 fixed it 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

10 participants