Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration of repositories with tags to organisation fails #8540

Closed
2 of 7 tasks
oaxiento opened this issue Oct 16, 2019 · 29 comments · Fixed by #8602
Closed
2 of 7 tasks

Migration of repositories with tags to organisation fails #8540

oaxiento opened this issue Oct 16, 2019 · 29 comments · Fixed by #8602
Labels
issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented type/bug

Comments

@oaxiento
Copy link

oaxiento commented Oct 16, 2019

  • Gitea version (or commit ref): 1.9.4
  • Git version: 1.8.3.1
  • Operating system: CentOS 7
  • Database (use [x]):
    • PostgreSQL
    • MariaDB
    • MSSQL
    • SQLite
  • Can you reproduce the bug at https://try.gitea.io:
    • Yes (provide example URL)
    • No
    • Not relevant
  • Log gist:

Description

We are currently migrating all of our 230 repositories to Gitea. Here's how we do it:

  1. git clone --mirror https://old-git.example.org/repo.git
  2. git push --mirror https://gitea.example.org/ORG/repo.git

To keep it simple our Gitea instance has just one organisation with around 110 "Owner" team members. When we migrate repositories with more than 50 tags to that organisation, the git client pushes everything without an error, but Gitea fails when creating releases for every tag with the following message:

2019/10/16 11:12:42 models/webhook.go:909:DeliverHooks() [E] Get repository [235] hook tasks: dial tcp 172.xxx.xxx.xxx:3306: connect: cannot assign requested address

In the Gitea logfile I can see that Gitea calls repo_permission.go for every tag and every user of that organisation, which results in a lot of calls and presumably database queries:

2019/10/16 11:12:39 models/update.go:80:PushUpdate() [T] TriggerTask 'TAGNAME' by gituser
2019/10/16 11:12:39 models/pull.go:1088:AddTestPullRequestTask() [T] AddTestPullRequestTask [head_repo_id: xxx, head_branch: xxx]: finding pull requests
2019/10/16 11:12:39 modules/auth/auth.go:97:CheckOAuthAccessToken() [T] ParseOAuth2Token: signature is invalid
2019/10/16 11:12:39 ...s/context/context.go:328:func1() [D] Session ID: xxx
2019/10/16 11:12:39 ...s/context/context.go:329:func1() [D] CSRF Token: xxx
2019/10/16 11:12:39 models/pull.go:1128:AddTestPullRequestTask() [T] AddTestPullRequestTask [base_repo_id: xxx, base_branch: xxx]: finding pull requests
2019/10/16 11:12:40 ...s/repo_permission.go:154:func1() [T] Permission Loaded for 824725727744:USER1 in 824725727824:ORG/reponame
2019/10/16 11:12:40 ...s/repo_permission.go:154:func1() [T] Permission Loaded for 824729407712:USER2 in 824729407808:ORG/reponame
2019/10/16 11:12:40 ...s/repo_permission.go:154:func1() [T] Permission Loaded for 824729900800:USER3 in 824729900880:ORG/reponame
[...]

When migrating that same repository to a "standalone" user account, everything works as expected. All releases get created. In the Gitea logfile I can see that gitea doesn't call repo_permission.go, in this case:

2019/10/16 12:18:27 models/update.go:80:PushUpdate() [T] TriggerTask 'TAG1' by gituser
2019/10/16 12:18:27 models/pull.go:1088:AddTestPullRequestTask() [T] AddTestPullRequestTask [head_repo_id: xxx, head_branch: xxx]: finding pull requests
2019/10/16 12:18:27 modules/auth/auth.go:97:CheckOAuthAccessToken() [T] ParseOAuth2Token: signature is invalid
2019/10/16 12:18:27 ...s/context/context.go:328:func1() [D] Session ID: xxx
2019/10/16 12:18:27 ...s/context/context.go:329:func1() [D] CSRF Token: xxx
2019/10/16 12:18:27 models/pull.go:1128:AddTestPullRequestTask() [T] AddTestPullRequestTask [base_repo_id: xxx, base_branch: xxx]: finding pull requests
2019/10/16 12:18:28 models/webhook.go:898:DeliverHooks() [T] DeliverHooks [repo_id: xxx]
2019/10/16 12:18:28 models/update.go:80:PushUpdate() [T] TriggerTask 'TAG2' by gituser
[...]

We also tried to migrate a very large repository with around 15.000 commits and 7000 tags. Although we get an error from the git client, everything (tags, commits) gets transmitted and all releases get created by gitea when migrating to a user repository.

user@linux:~/repo.git$ git push --mirror https://gitea.example.org/ORG/repo.git
Counting objects: 409616, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (108168/108168), done.
Writing objects: 100% (409616/409616), 1.14 GiB | 25.52 MiB/s, done.
Total 409616 (delta 201291), reused 409611 (delta 201286)
error: RPC failed; curl 56 GnuTLS recv error (-110): The TLS connection was non-properly terminated.
fatal: The remote end hung up unexpectedly
error: error in sideband demultiplexer
Everything up-to-date
user@linux:~/repo.git$ git push --mirror https://gitea.example.org/ORG/repo.git
Everything up-to-date

Long story short, when we migrate repositories with more, or way more than 50 tags to a user account, everything works. When we migrate them to an organizazion with a lot of members, Gitea gets stuck when creating releases from tags.

Workaround

As a workaround we first migrate to a user account and then transmit the ownership of the repository to the organisation.

@guillep2k
Copy link
Member

This seems to be related to #8528 (#8273).

I'd recomment changing app.ini and set:

[database]
MAX_IDLE_CONNS = 5
CONN_MAX_LIFE_TIME = 300

(I'd experiment with the values and see if they improve the results)

@zeripath
Copy link
Contributor

CONN_MAX_LIFE_TIME = 5m

Is easier to read.

@lunny lunny added the type/bug label Oct 17, 2019
@oaxiento
Copy link
Author

Thanks you both for your quick response. I changed the database settings and tried migrating one of our biggest repositories (>6,000 tags, >14,000 commits) to our Gitea organisation. After creating 854 releases and around 100,000 repo_permission.go calls I see the same error though:

2019/10/17 08:40:54 ...s/repo_permission.go:154:func1() [T] Permission Loaded for 824811255664:SOMEUSER in 824811255760:ORG/repo.git: ...
[...]
2019/10/17 09:06:58 ...s/repo_permission.go:154:func1() [T] Permission Loaded for 824662319392:SOMEUSER in 824662319520:ORG/repo.git: ...
2019/10/17 09:06:58 ...s/repo_permission.go:154:func1() [T] Permission Loaded for 824662792016:SOMEUSER in 824662792224:ORG/repo.git: ...
2019/10/17 09:06:58 ...ters/private/hook.go:165:HookPostReceive() [E] Failed to Update: ORG/repo.git Branch: BRANCHNAME Error: PushUpdate: CommitRepoAction: NotifyWatchers: insert new action: dial tcp 172.xxx.xxx.xxx:3306: connect: cannot assign requested address

I think this is an efficiency issue, as I don't see why there have to be so many repo_permission.go calls when creating releases from tags. When I migrate the same repository to a user account an then move it to the organisation repo_permission.go only runs one for the user transferring the repository.

@lunny
Copy link
Member

lunny commented Oct 17, 2019

@mansshardt Thanks for your reports. Could you share your database setting?

@oaxiento
Copy link
Author

Sure! Here they are:

[database]
DB_TYPE = mysql
HOST = mariadb-host.example.org:3306
NAME = gitea
USER = gitea
PASSWD = super-secret-password
SSL_MODE = disable
LOG_SQL = False
MAX_IDLE_CONNS = 5
CONN_MAX_LIFE_TIME = 300

@lunny
Copy link
Member

lunny commented Oct 17, 2019

300 means 300 nano seconds, NOT 300 seconds, that's too short. Could you change that to 5m and try agin.

@guillep2k
Copy link
Member

Sorry, my fault about the nanoseconds.

@zeripath
Copy link
Contributor

My fault too. I should have checked.

@oaxiento
Copy link
Author

No problem! I changed the setting to CONN_MAX_LIFE_TIME = 5m and restarted Gitea. Sadly, this didn't make a difference. After creating 766 releases I still get:

2019/10/17 15:14:30 ...ters/private/hook.go:165:HookPostReceive() [E] Failed to Update: ORG/repo.git Branch: BRANCHNAME Error: PushUpdate: CommitRepoAction: NotifyWatchers: insert new action: dial tcp 172.xxx.xxx.xxx:3306: connect: cannot assign requested address

@guillep2k
Copy link
Member

guillep2k commented Oct 17, 2019

172.xxx.xxx.xxx:3306: connect: cannot assign requested address

This is most likely because of attempting too many connections to MariaDB, but I thought this would be solved by the new settings.

As TCP can only create and destroy <64515 socket pairs in a small lapse of time (TIME_WAIT), and this import operation seems to be depleting that, perhaps you can try using UNIX sockets for your connection to MariaDB instead?

@oaxiento
Copy link
Author

oaxiento commented Oct 17, 2019

I just pushed (--mirror) again and checked the connections to the database via netstat -an | grep ':3306' | wc -l. During the creation of releases I have around 25,000 tcp sockets in TIME_WAIT. As mentioned before I'd guess that every permission call results in one database connection which doesn't get reused.

I don't understand why so many connections are nessecary in the first place. In my opinion fiddling with MariaDB properties or database settings to mitigate the issue shouldn't be a solution. I also can't use unix sockets as the database is not on the same machine.

@guillep2k
Copy link
Member

@mansshardt I agree that this should not be the solution. I was only offering alternatives. 😄

For some reason MAX_IDLE_CONNS didn't have any effect. AFAIU it should have had the effect of reusing the connections for most transactions. Especially after CONN_MAX_LIFE_TIME = 5m.

@zeripath
Copy link
Contributor

You probably need my pr which allows you to set maxopenconns to prevent too many open connections to the db

@guillep2k
Copy link
Member

@zeripath If I'm not mistaken, the problem in this issue is the number of closed connections, not the open ones.

@oaxiento
Copy link
Author

I agree that this should not be the solution. I was only offering alternatives. smile
For some reason MAX_IDLE_CONNS didn't have any effect. AFAIU it should have had the effect of reusing the connections for most transactions. Especially after CONN_MAX_LIFE_TIME = 5m.

@guillep2k Sure, and thanks again for your input.

The issue is not that there are to many open connections to the database as @guillep2k mentioned. The problem is, that during the creation of releases from tags, there is a huge amount of connections which get established and quickly closed. This doens't seem very efficient. As the system holds a tcp socket in TIME_WAIT for some time, we have a lot of these (~25.000) during the creation of releases from tags.

It seems that MAX_IDLE_CONNS and CONN_MAX_LIFE_TIME = 5m doesn't have any effect. I would expect this setting to build some kind of connection pool, which it doesn't.

I think two elements are relevant in this issue:

  1. The permission concept could be more efficient, so that pushing tags to an organisation with a lot of members doesn't result in so many database queries. The question is, why does the permission routine run for every release that get created from a tag?
  2. Gitea should hold a (configurable) connection pool so that database connections can get reused.

@guillep2k
Copy link
Member

I've checked the docs and the sources and I couldn't find a reason why the connections are not pooled.

https://www.alexedwards.net/blog/configuring-sqldb

image

@zeripath It looks like your PR #8528 could be related after all, but the default value should be working nonetheless. In theory, by not calling SetMaxOpenConns() we're effectively using MaxOpenConns == 0, which should allow for any value in SetMaxIdleConns(). But that somehow is not working.

@zeripath
Copy link
Contributor

@guillep2k yeah so with a long enough lifetime, a large enough MaxOpenConns and a MaxIdleConns near the MaxOpenConns should prevent the rapid opening and closing of connections preventing the port number depletion at least papering over this implementation fault.

We need to think about these permissions calls a bit better and consider if we can cache these results in some way.

In this particular case if we look at cmd/hook.go:

gitea/cmd/hook.go

Lines 124 to 186 in 280f4be

func runHookPostReceive(c *cli.Context) error {
if len(os.Getenv("SSH_ORIGINAL_COMMAND")) == 0 {
return nil
}
setup("hooks/post-receive.log")
// the environment setted on serv command
repoUser := os.Getenv(models.EnvRepoUsername)
isWiki := (os.Getenv(models.EnvRepoIsWiki) == "true")
repoName := os.Getenv(models.EnvRepoName)
pusherID, _ := strconv.ParseInt(os.Getenv(models.EnvPusherID), 10, 64)
pusherName := os.Getenv(models.EnvPusherName)
buf := bytes.NewBuffer(nil)
scanner := bufio.NewScanner(os.Stdin)
for scanner.Scan() {
buf.Write(scanner.Bytes())
buf.WriteByte('\n')
// TODO: support news feeds for wiki
if isWiki {
continue
}
fields := bytes.Fields(scanner.Bytes())
if len(fields) != 3 {
continue
}
oldCommitID := string(fields[0])
newCommitID := string(fields[1])
refFullName := string(fields[2])
res, err := private.HookPostReceive(repoUser, repoName, private.HookOptions{
OldCommitID: oldCommitID,
NewCommitID: newCommitID,
RefFullName: refFullName,
UserID: pusherID,
UserName: pusherName,
})
if res == nil {
fail("Internal Server Error", err)
}
if res["message"] == false {
continue
}
fmt.Fprintln(os.Stderr, "")
if res["create"] == true {
fmt.Fprintf(os.Stderr, "Create a new pull request for '%s':\n", res["branch"])
fmt.Fprintf(os.Stderr, " %s\n", res["url"])
} else {
fmt.Fprint(os.Stderr, "Visit the existing pull request:\n")
fmt.Fprintf(os.Stderr, " %s\n", res["url"])
}
fmt.Fprintln(os.Stderr, "")
}
return nil
}

Within the context of the hook we read each line of the provided stdin. We get one line per updated ref and they are of the form:

<old-value> SP <new-value> SP <ref-name> LF

This then gets translated to a GET request to the Gitea server calling:

// HookPostReceive updates services and users
func HookPostReceive(ctx *macaron.Context) {
ownerName := ctx.Params(":owner")
repoName := ctx.Params(":repo")
oldCommitID := ctx.Query("old")
newCommitID := ctx.Query("new")
refFullName := ctx.Query("ref")
userID := ctx.QueryInt64("userID")
userName := ctx.Query("username")
branch := refFullName
if strings.HasPrefix(refFullName, git.BranchPrefix) {
branch = strings.TrimPrefix(refFullName, git.BranchPrefix)
} else if strings.HasPrefix(refFullName, git.TagPrefix) {
branch = strings.TrimPrefix(refFullName, git.TagPrefix)
}
// Only trigger activity updates for changes to branches or
// tags. Updates to other refs (eg, refs/notes, refs/changes,
// or other less-standard refs spaces are ignored since there
// may be a very large number of them).
if strings.HasPrefix(refFullName, git.BranchPrefix) || strings.HasPrefix(refFullName, git.TagPrefix) {
repo, err := models.GetRepositoryByOwnerAndName(ownerName, repoName)
if err != nil {
log.Error("Failed to get repository: %s/%s Error: %v", ownerName, repoName, err)
ctx.JSON(http.StatusInternalServerError, map[string]interface{}{
"err": fmt.Sprintf("Failed to get repository: %s/%s Error: %v", ownerName, repoName, err),
})
return
}
if err := repofiles.PushUpdate(repo, branch, models.PushUpdateOptions{
RefFullName: refFullName,
OldCommitID: oldCommitID,
NewCommitID: newCommitID,
PusherID: userID,
PusherName: userName,
RepoUserName: ownerName,
RepoName: repoName,
}); err != nil {
log.Error("Failed to Update: %s/%s Branch: %s Error: %v", ownerName, repoName, branch, err)
ctx.JSON(http.StatusInternalServerError, map[string]interface{}{
"err": fmt.Sprintf("Failed to Update: %s/%s Branch: %s Error: %v", ownerName, repoName, branch, err),
})
return
}
}
if newCommitID != git.EmptySHA && strings.HasPrefix(refFullName, git.BranchPrefix) {
repo, err := models.GetRepositoryByOwnerAndName(ownerName, repoName)
if err != nil {
log.Error("Failed to get repository: %s/%s Error: %v", ownerName, repoName, err)
ctx.JSON(http.StatusInternalServerError, map[string]interface{}{
"err": fmt.Sprintf("Failed to get repository: %s/%s Error: %v", ownerName, repoName, err),
})
return
}
repo.OwnerName = ownerName
pullRequestAllowed := repo.AllowsPulls()
if !pullRequestAllowed {
ctx.JSON(http.StatusOK, map[string]interface{}{
"message": false,
})
return
}
baseRepo := repo
if repo.IsFork {
if err := repo.GetBaseRepo(); err != nil {
log.Error("Failed to get Base Repository of Forked repository: %-v Error: %v", repo, err)
ctx.JSON(http.StatusInternalServerError, map[string]interface{}{
"err": fmt.Sprintf("Failed to get Base Repository of Forked repository: %-v Error: %v", repo, err),
})
return
}
baseRepo = repo.BaseRepo
}
if !repo.IsFork && branch == baseRepo.DefaultBranch {
ctx.JSON(http.StatusOK, map[string]interface{}{
"message": false,
})
return
}
pr, err := models.GetUnmergedPullRequest(repo.ID, baseRepo.ID, branch, baseRepo.DefaultBranch)
if err != nil && !models.IsErrPullRequestNotExist(err) {
log.Error("Failed to get active PR in: %-v Branch: %s to: %-v Branch: %s Error: %v", repo, branch, baseRepo, baseRepo.DefaultBranch, err)
ctx.JSON(http.StatusInternalServerError, map[string]interface{}{
"err": fmt.Sprintf(
"Failed to get active PR in: %-v Branch: %s to: %-v Branch: %s Error: %v", repo, branch, baseRepo, baseRepo.DefaultBranch, err),
})
return
}
if pr == nil {
if repo.IsFork {
branch = fmt.Sprintf("%s:%s", repo.OwnerName, branch)
}
ctx.JSON(http.StatusOK, map[string]interface{}{
"message": true,
"create": true,
"branch": branch,
"url": fmt.Sprintf("%s/compare/%s...%s", baseRepo.HTMLURL(), util.PathEscapeSegments(baseRepo.DefaultBranch), util.PathEscapeSegments(branch)),
})
} else {
ctx.JSON(http.StatusOK, map[string]interface{}{
"message": true,
"create": false,
"branch": branch,
"url": fmt.Sprintf("%s/pulls/%d", baseRepo.HTMLURL(), pr.Index),
})
}
return
}
ctx.JSON(http.StatusOK, map[string]interface{}{
"message": false,
})
}

This has the benefit of meaning each commit sha id is logged for free but if you're updating a lot of refs that means that you get a lot of separate HTTP requests.

Pre-receive has a similar architecture.

Now, that architecture means that even if we were doing this within a single session we wouldn't get much benefit from session caching - although it might have some benefit.

A better architecture would be to pass all of the refs in a POST, we could then create a repofiles.PushUpdates which could have all the updated refs.

Unfortunately when I made these changes to the hooks I considered but dismissed the idea that anyone would be likely to send almost a thousand updates in one push so in terms of doing the least work I only made the simplest implementation.

@guillep2k
Copy link
Member

A better architecture would be to pass all of the refs in a POST, we could then create a repofiles.PushUpdates which could have all the updated refs.

Although optimization is always a good thing, I think the root of the problem here is the connection pool. It will bite us back anytime, not only with the migration of large repositories.

@zeripath
Copy link
Contributor

Yeah, at the end of the day - it doesn't matter how efficient this bit of code is - if you have not configured the pool properly you could run it out of connections with the correct kind of load.

At least with #8528 we will expose all the configurables that go provides to the user - if that's still not enough then we'll have to think about writing our own pool. ( One which at the least could handle this error and wait)

If MaxOpenConns and MaxIdleConns are equal then there should be at most MSL * MaxOpenCons / MaxLifetime TimeWait connections. If you change MaxIdleConns to be different from MaxOpenConns you're likely to need to increase the maxlifetime but there will be point at which there is no stable solution.

Without setting MaxOpenConns a sufficient load will cause port exhaustion.

@guillep2k
Copy link
Member

@zeripath Mmm... I was about to write a long explanation of how MaxOpenConns should not affect the number of closed connections but now I think I see your point. The only way to avoid the system from creating time_wait entries is to keep it from closing them as much as possible, so MaxIdleConns should be equal to MaxOpenConns in this type of application where many users can be doing operations at the same time.

Again, your PR seems on point. What I wonder is: what's the strategy in the database driver for scheduling the connection requests when they are all busy? FIFO? Are there others available?

@zeripath
Copy link
Contributor

Without looking at the code I would guess it's actually "random" - the most obvious implementation is a simple spin lock with a wait until you actually get the real lock. I would bet there is no formal queue - too expensive - so we're into OS level queuing algorithms, suggesting a likely bias towards LIFO.

@guillep2k
Copy link
Member

@mansshardt I know it's a bit of a hassle, but is it possible for you to grab @zeripath 's PR to build from source and try?:

[database]
; ...
MAX_OPEN_CONNS = 5
MAX_IDLE_CONNS = 5
CONN_MAX_LIFE_TIME = 5m

(It's important for the test that the first two values match)

@oaxiento
Copy link
Author

@guillep2k I will try that on monday when I am back at the office and get back to you.

@oaxiento
Copy link
Author

I just had the chance to test with a build from @zeripath PR and the following db settings:

[database]
; ...
MAX_OPEN_CONNS = 5
MAX_IDLE_CONNS = 5
CONN_MAX_LIFE_TIME = 5m

With this build and settings I can see a proper database pooling. During migration I have two or three tcp sockets in ESTABLISHED state, which get used randomly. After five minutes the sockets go in TIME_WAIT and then get removed, as expected. With this build the migration of huge repos with a lot of tags works well, even though it's pretty slow. But all releases get created properly. I think #8602 should improve the speed issue. Hope to see both fixes/improvements in a official release soon.

@zeripath
Copy link
Contributor

@mansshardt would you be able to try #8602 it would be a great test of the code if it worked.

@oaxiento
Copy link
Author

@mansshardt would you be able to try #8602 it would be a great test of the code if it worked.

@zeripath With my build from #8602 git doesn't timeout anymore, but just 190 of over 6000 tags got created as releases. Gitea didn't throw any error.

@zeripath
Copy link
Contributor

Damn that means that I have a bug...

@zeripath
Copy link
Contributor

OK @mansshardt I think that 200 is too large a batch size for gitea to process without the internal request timing out and that's why the you only get ~200 processed.

@stale
Copy link

stale bot commented Dec 20, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs during the next 2 weeks. Thank you for your contributions.

@stale stale bot added the issue/stale label Dec 20, 2019
@zeripath zeripath added the issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented label Dec 20, 2019
@stale stale bot removed the issue/stale label Dec 20, 2019
@go-gitea go-gitea locked and limited conversation to collaborators Nov 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants