如果应用持续反复修改一个大文件的部分内容，最终会导致EIO #4308

zp001paul · 2024-01-04T03:28:46Z

发现一个问题:
如果应用持续反复修改一个大文件的部分内容，最终会导致EIO。

问题重现的办法
juicefs mount -d "mysql://juiceschema:Juicestest++--@(juice2db-m.db.sfcloud.local:3306)/juiceschema" /juicefs_test/
fio -direct=1 -ioengine=sync -runtime=3600 -group_reporting -bs=8k -filesize=1g -nrfiles=1 -thread -iodepth 16 -numjobs=1 -readwrite=randwrite -name=randwrite -directory=/juicefs_test
就是单线程反复修改一个文件的部分数据。我们的环境中约运行10分钟会看到IO Error (EIO)
具体报错如下：

2023/02/10 11:32:51.014410 juicefs[82643] <WARNING>: write inode:273634 error: input/output error [writer.go:208]
2023/02/10 11:32:51.014443 juicefs[82643] <ERROR>: write inode:273634 indx:0  input/output error [writer.go:212]
2023/02/10 11:32:51.058436 juicefs[82643] <ERROR>: error: Error 1406: Data too long for column 'slices' at row 1  //这个报错就是下面这个stack报出来的，报错的inode就是上面的 273634
goroutine 181792171 [running]:
runtime/debug.Stack()
        /usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/juicedata/juicefs/pkg/meta.errno({0x31e12c0, 0xc04c3287e0})
        /MyEBook/juicefs/pkg/meta/utils.go:104 +0xc5
github.com/juicedata/juicefs/pkg/meta.(*dbMeta).Write(0xc000acc0a0, {0xc09bfb1f78?, 0x3?}, 0x42ce2, 0x0, 0xb000, {0x1706873, 0x1000, 0x0, 0x1000})
        /MyEBook/juicefs/pkg/meta/sql.go:2037 +0x365
github.com/juicedata/juicefs/pkg/vfs.(*chunkWriter).commitThread(0xc0866521b0)
        /MyEBook/juicefs/pkg/vfs/writer.go:201 +0x1c4
created by github.com/juicedata/juicefs/pkg/vfs.(*fileWriter).writeChunk
        /MyEBook/juicefs/pkg/vfs/writer.go:270 +0x3d9 [utils.go:104]
2023/02/10 11:32:51.058499 juicefs[82643] <WARNING>: write inode:273634 error: input/output error [writer.go:208]
2023/02/10 11:32:51.058520 juicefs[82643] <ERROR>: write inode:273634 indx:0  input/output error [writer.go:212]
2023/02/10 11:32:51.106496 juicefs[82643] <ERROR>: error: Error 1406: Data too long for column 'slices' at row 1
goroutine 181792171 [running]:
runtime/debug.Stack()
        /usr/local/go/src/runtime/debug/stack.go:24 +0x65
github.com/juicedata/juicefs/pkg/meta.errno({0x31e12c0, 0xc0646189f0})
        /MyEBook/juicefs/pkg/meta/utils.go:104 +0xc5
github.com/juicedata/juicefs/pkg/meta.(*dbMeta).Write(0xc000acc0a0, {0xc09bfb1f78?, 0x3?}, 0x42ce2, 0x0, 0xfe000, {0x1706879, 0x1000, 0x0, 0x1000})
        /MyEBook/juicefs/pkg/meta/sql.go:2037 +0x365
github.com/juicedata/juicefs/pkg/vfs.(*chunkWriter).commitThread(0xc0866521b0)
        /MyEBook/juicefs/pkg/vfs/writer.go:201 +0x1c4
created by github.com/juicedata/juicefs/pkg/vfs.(*fileWriter).writeChunk
        /MyEBook/juicefs/pkg/vfs/writer.go:270 +0x3d9 [utils.go:104]
2023/02/10 11:32:51.106556 juicefs[82643] <WARNING>: write inode:273634 error: input/output error [writer.go:208]
2023/02/10 11:32:51.106577 juicefs[82643] <ERROR>: write inode:273634 indx:0  input/output error [writer.go:212]
2023/02/10 11:32:51.150219 juicefs[82643] <ERROR>: error: Error 1406: Data too long for column 'slices' at row 1

根因分析:

根本原因不在compact chunk的速度跟不上应用写入速度。
根本原因是：如果多个chunk需要做compact，那么总有部分chunk得不到compact ，导致最终length(slices)溢出
关键问题代码：

func (m *dbMeta) compactChunk(inode Ino, indx uint32, force bool) {
	if !force {
		// avoid too many or duplicated compaction
		m.Lock()
		k := uint64(inode) + (uint64(indx) << 32)
		if len(m.compacting) > 10 || m.compacting[k] { // <-----真正的问题代码在这里：如果当前有超过10个compact goroutine，那么会有(ino, chunk_idx)组合得不到运行
			m.Unlock()
			return
		}
}

建议的改进:

len(m.compacting) > 10 时不要丢（ino，c_indx），而是存到一个slice去
在压缩完成的时候（即len(ss) < 2），从slice取一个（ino，c_indx）出来继续运行
降低压缩完成标准：len(ss) < 2变成len(ss) < 100

初步改进的代码

func (m *dbMeta) compactChunk(inode Ino, indx uint32, force bool) {
	cptKey := compactKey{inode, indx}
	if !force {
		m.Lock()
		running, cptKeyExisted := m.compactMap[cptKey]
		// avoid duplicated compaction
		if cptKeyExisted && running {
			m.Unlock()
			logger.Warnf("Quit compact for [%d, %d], already running", inode, indx)
			return
		}
		// avoid too many compaction
		if len(m.compactMap) > 3 {
			// cptKeyExisted  : we need to run it till the end
			// !pctKeyExisted : leave it for other goroutine to start it
			if !cptKeyExisted {
				m.addToCompactPending(cptKey) // simply like : m.compactPending = append(m.compactPending, cptKey)
				logger.Warnf("Add to pending queue for [%d, %d], len1: %d, len2: %d",
					inode, indx, len(m.compactMap), len(m.compactPending))
				m.Unlock()
				return
			}
		}

		m.compactMap[cptKey] = true
		m.Unlock()

		defer func() {
			m.Lock()
			//m.compactCnt--
			// compaction not done
			if _, ok := m.compactMap[cptKey]; ok {
				m.compactMap[cptKey] = false
			}
			m.Unlock()
		}()
	}

	logger.Warnf("Run compact for [%d, %d],compactMap len: %d", inode, indx, len(m.compactMap))

	var c = chunk{Inode: inode, Indx: indx}
	err := m.roTxn(func(s *xorm.Session) error {
		_, err := s.MustCols("indx").Get(&c)
		return err
	})
	if err != nil {
		logger.Warnf("return err1: %s", err.Error())
		return
	}

	ss := readSliceBuf(c.Slices)
	if ss == nil {
		logger.Errorf("Corrupt value for inode %d chunk indx %d", inode, indx)
		return
	}
	skipped := skipSome(ss)
	ss = ss[skipped:]
	pos, size, slices := compactChunk(ss)

       // compaction complete
	if len(ss) < 100 || size == 0 {
		m.Lock()
		delete(m.compactMap, cptKey)
		go m.runNextCompaction()
		m.Unlock()
		logger.Warnf("Deleted and End compacting for [%d, %d], compactMap: %v", inode, indx, m.compactMap)
		return
	}
...
}

Environment:

JuiceFS version (use juicefs --version) or Hadoop Java SDK version: juicefs version 1.0.4+2023-04-06.f1c475d9
Cloud provider or hardware configuration running JuiceFS:
-- oss: swift
-- mysql 8.0
-- juicefs client: bare metal x86, oracle-linux 7.9
OS (e.g cat /etc/os-release):
NAME="Oracle Linux Server"
VERSION="7.9"
ID="ol"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.9"
PRETTY_NAME="Oracle Linux Server 7.9"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:oracle:linux:7:9:server"
HOME_URL="https://linux.oracle.com/"
BUG_REPORT_URL="https://bugzilla.oracle.com/"

ORACLE_BUGZILLA_PRODUCT="Oracle Linux 7"
ORACLE_BUGZILLA_PRODUCT_VERSION=7.9
ORACLE_SUPPORT_PRODUCT="Oracle Linux"
ORACLE_SUPPORT_PRODUCT_VERSION=7.9

Kernel (e.g. uname -a): Linux zhangpudesktop 3.10.0-1160.el7.x86_64 update reader length after write #1 SMP Thu Oct 1 17:21:35 PDT 2020 x86_64 x86_64 x86_64 GNU/Linux
Object storage (cloud provider and region, or self maintained): swift
Metadata engine info (version, cloud provider managed or self maintained): mysql 8.0
Network connectivity (JuiceFS to metadata engine, JuiceFS to object storage): GiB, LAN
Others:

The text was updated successfully, but these errors were encountered:

SandyXSD · 2024-01-04T09:11:29Z

简单点可以在有许多 slices 时更加激进地尝试触发 compaction，尽量避免其被饿死。
@zp001paul 可以试下这个 PR 么？#4309

zp001paul · 2024-01-10T08:17:33Z

今天重测了一下，问题解决了，fix真快啊。

zp001paul added the kind/bug Something isn't working label Jan 4, 2024

davies assigned SandyXSD Jan 4, 2024

SandyXSD mentioned this issue Jan 4, 2024

meta: trigger compaction more aggresively when there are too many slices #4309

Merged

davies closed this as completed in #4309 Jan 8, 2024

zhoucheng361 added the missed missed bug label Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

如果应用持续反复修改一个大文件的部分内容，最终会导致EIO #4308

如果应用持续反复修改一个大文件的部分内容，最终会导致EIO #4308

zp001paul commented Jan 4, 2024

SandyXSD commented Jan 4, 2024 •

edited

Loading

zp001paul commented Jan 10, 2024

如果应用持续反复修改一个大文件的部分内容，最终会导致EIO #4308

如果应用持续反复修改一个大文件的部分内容，最终会导致EIO #4308

Comments

zp001paul commented Jan 4, 2024

SandyXSD commented Jan 4, 2024 • edited Loading

zp001paul commented Jan 10, 2024

SandyXSD commented Jan 4, 2024 •

edited

Loading