Skip to content

Commit

Permalink
opt: add decoder Option NoValidateJSON for skipping JSON faster (#696)
Browse files Browse the repository at this point in the history
Co-authored-by: liuqiang.06 <[email protected]>
  • Loading branch information
AsterDY and liuq19 authored Sep 13, 2024
1 parent 9e5bbc5 commit edc70ff
Show file tree
Hide file tree
Showing 100 changed files with 114,595 additions and 101,326 deletions.
20 changes: 11 additions & 9 deletions .github/workflows/benchmark.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,12 @@ jobs:
${{ runner.os }}-go-
- name: Benchmark Target
continue-on-error: true
run: |
export SONIC_NO_ASYNC_GC=1
export SONIC_BENCH_SINGLE=1
go test -run ^$ -count=10 -benchmem -bench 'Benchmark(Encoder|Decoder)_(Generic|Binding)_Sonic' ./decoder >> /var/tmp/sonic_bench_target.out
go test -run ^$ -count=10 -benchmem -bench 'Benchmark(Get|Set)One_Sonic|BenchmarkParseSeven_Sonic' ./ast >> /var/tmp/sonic_bench_target.out
go test -run ^$ -count=100 -benchmem -bench 'BenchmarkDecoder_(Generic|Binding)_Sonic' ./decoder >> /var/tmp/sonic_bench_target_${{ github.run_id }}.out
go test -run ^$ -count=100 -benchmem -bench 'BenchmarkEncoder_(Generic|Binding)_Sonic' ./encoder >> /var/tmp/sonic_bench_target_${{ github.run_id }}.out
go test -run ^$ -count=100 -benchmem -bench 'Benchmark(Get|Set)One_Sonic|BenchmarkParseSeven_Sonic' ./ast >> /var/tmp/sonic_bench_target_${{ github.run_id }}.out
- name: Clear repository
run: sudo rm -fr $GITHUB_WORKSPACE && mkdir $GITHUB_WORKSPACE
Expand All @@ -42,14 +43,15 @@ jobs:
ref: main

- name: Benchmark main
continue-on-error: true
run: |
export SONIC_NO_ASYNC_GC=1
export SONIC_BENCH_SINGLE=1
go test -run ^$ -count=10 -benchmem -bench 'Benchmark(Encoder|Decoder)_(Generic|Binding)_Sonic' ./decoder >> /var/tmp/sonic_bench_main.out
go test -run ^$ -count=10 -benchmem -bench 'Benchmark(Get|Set)One_Sonic|BenchmarkParseSeven_Sonic' ./ast >> /var/tmp/sonic_bench_main.out
UNIQUE_ID=${{ github.run_id }}
go test -run ^$ -count=100 -benchmem -bench 'BenchmarkDecoder_(Generic|Binding)_Sonic' ./decoder >> /var/tmp/sonic_bench_main_${{ github.run_id }}.out
go test -run ^$ -count=100 -benchmem -bench 'BenchmarkEncoder_(Generic|Binding)_Sonic' ./encoder >> /var/tmp/sonic_bench_main_${{ github.run_id }}.out
go test -run ^$ -count=100 -benchmem -bench 'Benchmark(Get|Set)One_Sonic|BenchmarkParseSeven_Sonic' ./ast >> /var/tmp/sonic_bench_main_${{ github.run_id }}.out
- name: Diff bench
continue-on-error: true
run: |
go get golang.org/x/perf/cmd/benchstat && go install golang.org/x/perf/cmd/benchstat
benchstat -format=csv /var/tmp/sonic_bench_target.out /var/tmp/sonic_bench_main.out
# run: ./scripts/bench.py -t 0.05 -d /var/tmp/sonic_bench_target.out,/var/tmp/sonic_bench_main.out x
./scripts/bench.py -t 0.20 -d /var/tmp/sonic_bench_target_${{ github.run_id }}.out,/var/tmp/sonic_bench_main_${{ github.run_id }}.out x
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,4 @@ fuzz/testdata
*__debug_bin*
*pprof
*coverage.txt
tools/venv/*
28 changes: 18 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -283,19 +283,21 @@ sub := root.Get("key3").Index(2).Int64() // == 3
**Tip**: since `Index()` uses offset to locate data, which is much faster than scanning like `Get()`, we suggest you use it as much as possible. And sonic also provides another API `IndexOrGet()` to underlying use offset as well as ensure the key is matched.

#### SearchOption

`Searcher` provides some options for user to meet different needs:

```go
opts := ast.SearchOption{ CopyReturn: true ... }
val, err := sonic.GetWithOptions(JSON, opts, "key")
```

- CopyReturn
Indicate the searcher to copy the result JSON string instead of refer from the input. This can help to reduce memory usage if you cache the results
- ConcurentRead
Since `ast.Node` use `Lazy-Load` design, it doesn't support Concurrently-Read by default. If you want to read it concurrently, please specify it.
- ValidateJSON
Indicate the searcher to validate the entire JSON. This option is enabled by default, which slow down the search speed a little.


#### Set/Unset

Modify the json content by Set()/Unset()
Expand All @@ -314,6 +316,7 @@ println(alias1 == alias2) // true
exist, err := root.UnsetByIndex(1) // exist == true
println(root.Get("key4").Check()) // "value not exist"
```

#### Serialize

To encode `ast.Node` as json, use `MarshalJson()` or `json.Marshal()` (MUST pass the node's pointer)
Expand Down Expand Up @@ -381,16 +384,12 @@ See [ast/visitor.go](https://github.com/bytedance/sonic/blob/main/ast/visitor.go

## Compatibility

Sonic **DOES NOT** ensure to support all environments, due to the difficulty of developing high-performance codes. For developers who use sonic to build their applications in different environments, we have the following suggestions:

- Developing on **Mac M1**: Make sure you have Rosetta 2 installed on your machine, and set `GOARCH=amd64` when building your application. Rosetta 2 can automatically translate x86 binaries to arm64 binaries and run x86 applications on Mac M1.
- Developing on **Linux arm64**: You can install qemu and use the `qemu-x86_64 -cpu max` command to convert x86 binaries to amr64 binaries for applications built with sonic. The qemu can achieve a similar transfer effect to Rosetta 2 on Mac M1.
For developers who want to use sonic to meet diffirent scenarios, we provide some integrated configs as `sonic.API`

For developers who want to use sonic on Linux arm64 without qemu, or those who want to handle JSON strictly consistent with `encoding/json`, we provide some compatible APIs as `sonic.API`

- `ConfigDefault`: the sonic's default config (`EscapeHTML=false`,`SortKeys=false`...) to run on sonic-supporting environment. It will fall back to `encoding/json` with the corresponding config, and some options like `SortKeys=false` will be invalid.
- `ConfigStd`: the std-compatible config (`EscapeHTML=true`,`SortKeys=true`...) to run on sonic-supporting environment. It will fall back to `encoding/json`.
- `ConfigFastest`: the fastest config (`NoQuoteTextMarshaler=true`) to run on sonic-supporting environment. It will fall back to `encoding/json` with the corresponding config, and some options will be invalid.
- `ConfigDefault`: the sonic's default config (`EscapeHTML=false`,`SortKeys=false`...) to run sonic fast meanwhile ensure security.
- `ConfigStd`: the std-compatible config (`EscapeHTML=true`,`SortKeys=true`...)
- `ConfigFastest`: the fastest config (`NoQuoteTextMarshaler=true`) to run on sonic as fast as possible.
Sonic **DOES NOT** ensure to support all environments, due to the difficulty of developing high-performance codes. On non-sonic-supporting environment, the implementation will fall back to `encoding/json`. Thus beflow configs will all equal to `ConfigStd`.

## Tips

Expand Down Expand Up @@ -480,8 +479,17 @@ For better performance, in previous case the `ast.Visitor` will be the better ch
But `ast.Visitor` is not a very handy API. You might need to write a lot of code to implement your visitor and carefully maintain the tree hierarchy during decoding. Please read the comments in [ast/visitor.go](https://github.com/bytedance/sonic/blob/main/ast/visitor.go) carefully if you decide to use this API.

### Buffer Size

Sonic use memory pool in many places like `encoder.Encode`, `ast.Node.MarshalJSON` to improve performace, which may produce more memory usage (in-use) when server's load is high. See [issue 614](https://github.com/bytedance/sonic/issues/614). Therefore, we introduce some options to let user control the behavior of memory pool. See [option](https://pkg.go.dev/github.com/bytedance/[email protected]/option#pkg-variables) package.

### Faster JSON skip

For security, sonic use [FSM](native/skip_one.c) algorithm to validate JSON when decoding raw JSON or encoding `json.Marshaler`, which is much slower (1~10x) than [SIMD-searching-pair](native/skip_one_fast.c) algorithm. If user has many redundant JSON value and DO NOT NEED to strictly validate JSON correctness, you can enable below options:

- `Config.NoValidateSkipJSON`: for faster skipping JSON when decoding, such as unknown fields, json.Unmarshaler(json.RawMessage), mismatched values, and redundant array elements
- `Config.NoValidateJSONMarshaler`: avoid validating JSON when encoding `json.Marshaler`
- `SearchOption.ValidateJSON`: indicates if validate located JSON value when `Get`

## Community

Sonic is a subproject of [CloudWeGo](https://www.cloudwego.io/). We are committed to building a cloud native ecosystem.
26 changes: 17 additions & 9 deletions README_ZH_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -283,11 +283,14 @@ sub := root.Get("key3").Index(2).Int64() // == 3
**注意**:由于 `Index()` 使用偏移量来定位数据,比使用扫描的 `Get()` 要快的多,建议尽可能的使用 `Index` 。 Sonic 也提供了另一个 API, `IndexOrGet()` ,以偏移量为基础并且也确保键的匹配。

#### 查找选项

`ast.Searcher`提供了一些选项,以满足用户的不同需求:

```
opts:= ast.SearchOption{CopyReturn: true…}
Val, err:= sonic。gettwithoptions (JSON, opts, "key")
```

- CopyReturn
指示搜索器复制结果JSON字符串,而不是从输入引用。如果用户缓存结果,这有助于减少内存使用
- ConcurentRead
Expand Down Expand Up @@ -381,16 +384,12 @@ type Visitor interface {

## 兼容性

由于开发高性能代码的困难性, Sonic ****保证对所有环境的支持。对于在不同环境中使用 Sonic 构建应用程序的开发者,我们有以下建议:

-**Mac M1** 上开发:确保在您的计算机上安装了 Rosetta 2,并在构建时设置 `GOARCH=amd64` 。 Rosetta 2 可以自动将 x86 二进制文件转换为 arm64 二进制文件,并在 Mac M1 上运行 x86 应用程序。
-**Linux arm64** 上开发:您可以安装 qemu 并使用 `qemu-x86_64 -cpu max` 命令来将 x86 二进制文件转换为 arm64 二进制文件。qemu可以实现与Mac M1上的Rosetta 2类似的转换效果。

对于希望在不使用 qemu 下使用 sonic 的开发者,或者希望处理 JSON 时与 `encoding/JSON` 严格保持一致的开发者,我们在 `sonic.API` 中提供了一些兼容性 API
对于想要使用sonic来满足不同场景的开发人员,我们提供了一些集成配置:

- `ConfigDefault`: 在支持 sonic 的环境下 sonic 的默认配置(`EscapeHTML=false``SortKeys=false`等)。行为与具有相应配置的 `encoding/json` 一致,一些选项,如 `SortKeys=false` 将无效。
- `ConfigStd`: 在支持 sonic 的环境下与标准库兼容的配置(`EscapeHTML=true``SortKeys=true`等)。行为与 `encoding/json` 一致。
- `ConfigFastest`: 在支持 sonic 的环境下运行最快的配置(`NoQuoteTextMarshaler=true`)。行为与具有相应配置的 `encoding/json` 一致,某些选项将无效。
- `ConfigDefault`: sonic的默认配置 (`EscapeHTML=false``SortKeys=false`…) 保证性能同时兼顾安全性。
- `ConfigStd`: 与 `encoding/json` 保证完全兼容的配置
- `ConfigFastest`: 最快的配置(`NoQuoteTextMarshaler=true...`) 保证性能最优但是会缺少一些安全性检查(validate UTF8 等)
Sonic ****确保支持所有环境,由于开发高性能代码的困难。在不支持声音的环境中,实现将回落到 `encoding/json`。因此上述配置将全部等于`ConfigStd`

## 注意事项

Expand Down Expand Up @@ -478,8 +477,17 @@ go someFunc(user)
但是,`ast.Visitor` 并不是一个很易用的 API。你可能需要写大量的代码去实现自己的 `ast.Visitor`,并且需要在解析过程中仔细维护树的层级。如果你决定要使用这个 API,请先仔细阅读 [ast/visitor.go](https://github.com/bytedance/sonic/blob/main/ast/visitor.go) 中的注释。

### 缓冲区大小

Sonic在许多地方使用内存池,如`encoder.Encode`, `ast.Node.MarshalJSON`等来提高性能,这可能会在服务器负载高时产生更多的内存使用(in-use)。参见[issue 614](https://github.com/bytedance/sonic/issues/614)。因此,我们引入了一些选项来让用户配置内存池的行为。参见[option](https://pkg.go.dev/github.com/bytedance/[email protected]/option#pkg-variables)包。

### 更快的 JSON Skip

为了安全起见,在跳过原始JSON 时,sonic decoder 默认使用[FSM](native/skip_one.c)算法扫描来跳过同时校验 JSON。它相比[SIMD-searching-pair](native/skip_one_fast.c)算法跳过要慢得多(1~10倍)。如果用户有很多冗余的JSON值,并且不需要严格验证JSON的正确性,你可以启用以下选项:

- `Config.NoValidateSkipJSON`: 用于在解码时更快地跳过JSON,例如未知字段,`json.RawMessage`,不匹配的值和冗余的数组元素等
- `Config.NoValidateJSONMarshaler`: 编码JSON时避免验证JSON。封送拆收器
- `SearchOption.ValidateJSON`: 指示当`Get`时是否验证定位的JSON值

## 社区

Sonic 是 [CloudWeGo](https://www.cloudwego.io/) 下的一个子项目。我们致力于构建云原生生态系统。
5 changes: 5 additions & 0 deletions api.go
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,10 @@ type Config struct {
// NoValidateJSONMarshaler indicates that the encoder should not validate the output string
// after encoding the JSONMarshaler to JSON.
NoValidateJSONMarshaler bool

// NoValidateJSONSkip indicates the decoder should not validate the JSON value when skipping it,
// such as unknown-fields, mismatched-type, redundant elements..
NoValidateJSONSkip bool

// NoEncoderNewline indicates that the encoder should not add a newline after every message
NoEncoderNewline bool
Expand All @@ -109,6 +113,7 @@ var (
ConfigFastest = Config{
NoQuoteTextMarshaler: true,
NoValidateJSONMarshaler: true,
NoValidateJSONSkip: true,
}.Froze()
)

Expand Down
2 changes: 2 additions & 0 deletions decoder/decoder_compat.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ const (
_F_use_number = types.B_USE_NUMBER
_F_validate_string = types.B_VALIDATE_STRING
_F_allow_control = types.B_ALLOW_CONTROL
_F_no_validate_json = types.B_NO_VALIDATE_JSON
)

type Options uint64
Expand All @@ -53,6 +54,7 @@ const (
OptionDisableUnknown Options = 1 << _F_disable_unknown
OptionCopyString Options = 1 << _F_copy_string
OptionValidateString Options = 1 << _F_validate_string
OptionNoValidateJSON Options = 1 << _F_no_validate_json
)

func (self *Decoder) SetOptions(opts Options) {
Expand Down
1 change: 1 addition & 0 deletions decoder/decoder_native.go
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ const (
OptionDisableUnknown Options = api.OptionDisableUnknown
OptionCopyString Options = api.OptionCopyString
OptionValidateString Options = api.OptionValidateString
OptionNoValidateJSON Options = api.OptionNoValidateJSON
)

// StreamDecoder is the decoder context object for streaming input.
Expand Down
88 changes: 81 additions & 7 deletions decoder/decoder_native_test.go
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
//go:build (amd64 && go1.17 && !go1.24) || (arm64 && go1.20 && !go1.24)
// +build amd64,go1.17,!go1.24 arm64,go1.20,!go1.24


/*
* Copyright 2021 ByteDance Inc.
*
Expand All @@ -21,15 +20,90 @@
package decoder

import (
`encoding/json`
_`strings`
`testing`
_`reflect`
"encoding/json"
"fmt"
_ "reflect"
"strings"
_ "strings"
"testing"
"time"

`github.com/bytedance/sonic/internal/rt`
`github.com/stretchr/testify/assert`
"github.com/bytedance/sonic/internal/rt"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)


func BenchmarkSkipValidate(b *testing.B) {
type skiptype struct {
A int `json:"a"` // mismatched
B string `json:"-"` // ommited
C [1]int `json:"c"` // fast int
D struct {} `json:"d"` // empty struct
E map[string]int `json:"e"` // mismatched elem
F json.RawMessage `json:"f"` // unmarshaler
// Unknonwn
}
type C struct {
name string
json string
expTime float64
}
var sam = map[int]interface{}{}
for i := 0; i < 1; i++ {
sam[i] = _BindingValue
}
comptd, err := json.Marshal(sam)
if err != nil {
b.Fatal("invalid json")
}
compt := string(comptd)
var cases = []C{
{"mismatched", `{"a":`+compt+`}`, 5},
{"ommited", `{"b":`+compt+`}`, 5},
{"number", `{"c":[`+strings.Repeat("-1.23456e-19,", 1000)+`1]}`, 1.5},
{"unknown", `{"unknown":`+compt+`}`, 5},
{"empty", `{"d":`+compt+`}`, 5},
{"mismatched elem", `{"e":`+compt+`}`, 5},
{"unmarshaler", `{"f":`+compt+`}`, 3},
}
_ = NewDecoder(`{}`).Decode(&skiptype{})

var avg1, avg2 time.Duration
for _, c := range cases {
b.Run(c.name, func(b *testing.B) {
b.Run("validate", func(b *testing.B) {
b.ResetTimer()
t1 := time.Now()
for i := 0; i < b.N; i++ {
var obj1 = &skiptype{}
// validate skip
d := NewDecoder(c.json)
_ = d.Decode(obj1)
}
d1 := time.Since(t1)
avg1 = d1/time.Duration(b.N)
})
b.Run("fast", func(b *testing.B) {
b.ResetTimer()
t2 := time.Now()
for i := 0; i < b.N; i++ {
var obj2 = &skiptype{}
// fask skip
d := NewDecoder(c.json)
d.SetOptions(OptionNoValidateJSON)
_ = d.Decode(obj2)
}
d2 := time.Since(t2)
avg2 = d2/time.Duration(b.N)
})
// fast skip must be expTime x faster
require.True(b, float64(avg1)/float64(avg2) > c.expTime, fmt.Sprintf("%v/%v=%v", avg1, avg2, float64(avg1)/float64(avg2)))
})
}
}


func TestSkipMismatchTypeAmd64Error(t *testing.T) {
// t.Run("struct", func(t *testing.T) {
// println("TestSkipError")
Expand Down
21 changes: 10 additions & 11 deletions decoder/decoder_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,16 @@
package decoder

import (
`encoding/json`
`runtime`
`runtime/debug`
`strings`
`sync`
`testing`
`time`

`github.com/stretchr/testify/assert`
`github.com/stretchr/testify/require`
"encoding/json"
"runtime"
"runtime/debug"
"strings"
"sync"
"testing"
"time"

"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)

func TestMain(m *testing.M) {
Expand Down Expand Up @@ -85,7 +85,6 @@ func init() {
_ = json.Unmarshal([]byte(TwitterJson), &_BindingValue)
}


func TestSkipMismatchTypeError(t *testing.T) {
t.Run("struct", func(t *testing.T) {
println("TestSkipError")
Expand Down
1 change: 1 addition & 0 deletions internal/decoder/api/decoder.go
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ const (
OptionDisableUnknown = consts.OptionDisableUnknown
OptionCopyString = consts.OptionCopyString
OptionValidateString = consts.OptionValidateString
OptionNoValidateJSON = consts.OptionNoValidateJSON
)

type (
Expand Down
3 changes: 3 additions & 0 deletions internal/decoder/consts/option.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,11 @@ const (
F_disable_unknown = 3
F_copy_string = 4


F_use_number = types.B_USE_NUMBER
F_validate_string = types.B_VALIDATE_STRING
F_allow_control = types.B_ALLOW_CONTROL
F_no_validate_json = types.B_NO_VALIDATE_JSON
)

type Options uint64
Expand All @@ -26,6 +28,7 @@ const (
OptionDisableUnknown Options = 1 << F_disable_unknown
OptionCopyString Options = 1 << F_copy_string
OptionValidateString Options = 1 << F_validate_string
OptionNoValidateJSON Options = 1 << F_no_validate_json
)

const (
Expand Down
Loading

0 comments on commit edc70ff

Please sign in to comment.