Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dgraph v20.03.0 Regexp Not returning expected results #5102

Closed
darkn3rd opened this issue Apr 3, 2020 · 5 comments · Fixed by #5255
Closed

Dgraph v20.03.0 Regexp Not returning expected results #5102

darkn3rd opened this issue Apr 3, 2020 · 5 comments · Fixed by #5255
Labels
area/querylang/function priority/P0 Critical issue that requires immediate attention. status/accepted We accept to investigate/work on it.

Comments

@darkn3rd
Copy link
Contributor

darkn3rd commented Apr 3, 2020

SUMMARY

Using a query like regexp(name@en, /.*alien.*/i)) returns back an empty set. This was discovered in the dgraph tour: https://dgraph.io/tour/search/3/#

ENVIRONMENT

  • Version: docker image dgraph/dgraph:v20.03.0
  • System: Docker Desktop v2.2.0.5 6GB running on host Mac OS 10.14.6

STEPS

STEP 1 - Env

docker-compose up -d

Compose File

version: "3.2"
services:
  zero:
    image: dgraph/dgraph:${DGRAPH_VERSION}
    volumes:
      - /tmp/data:/dgraph
    ports:
      - 5080:5080
      - 6080:6080
    restart: on-failure
    command: dgraph zero --my=zero:5080
  alpha:
    image: dgraph/dgraph:${DGRAPH_VERSION}
    volumes:
      - /tmp/data:/dgraph
    ports:
      - 8080:8080
      - 9080:9080
    restart: on-failure
    command: dgraph alpha --my=alpha:7080 --lru_mb=2048 --zero=zero:5080
  ratel:
    image: dgraph/dgraph:${DGRAPH_VERSION}
    ports:
      - 8000:8000
    command: dgraph-ratel

STEP 2 - Schema

Apply the Schema in:

STEP 3 - Load Data

wget "https://github.com/dgraph-io/tutorial/blob/master/resources/1million.rdf.gz?raw=true" -O 1million.rdf.gz -q
docker cp ./1million.rdf.gz dgraph_zero_1:/tmp
docker exec -it dgraph_zero_1 dgraph live -f /tmp/1million.rdf.gz --alpha alpha:9080 --zero zero:5080 -c 1

STEP 3 - Indexes

Apply Indexes from: https://dgraph.io/tour/search/1/

name: string @index(term, fulltext, trigram) @lang .

STEP 4 - Run query

From: https://dgraph.io/tour/search/3/#

{
  aliens(func: regexp(name@en, /.*alien.*/i)) @cascade {
    name@en
    ~genre {
      name@en
      starring {
        performance.actor @filter(regexp(name@en,
          /(.*ali.*e.*n.*)|(.*a.*lie.*n.*)|(.*a.*l.*ien.*)/i)) {
          name@en
        }
      }
    }
  }
}

EXPECTED BEHAVIOR

{
  "data": {
    "aliens": [
      {
        "name@en": "Alien Film",
        "~genre": [
          {
            "name@en": "Mars Attacks!",
            "starring": [
              {
                "performance.actor": [
                  {
                    "name@en": "Natalie Portman"
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
/* ... */
}

ACTUAL BEHAVIOR

{
  "data": {
    "aliens": []
  }
/* ... */
}

WORKAROUND

I modified the query and this worked:

{
  aliens(func: regexp(name@en, /.*ali.*e.*n.*/i)) @cascade {
    name@en
    ~genre {
      name@en
      starring {
        performance.actor @filter(regexp(name@en,
          /(.*ali.*e.*n.*)|(.*a.*lie.*n.*)|(.*a.*l.*ien.*)/i)) {
          name@en
        }
      }
    }
  }
}
@darkn3rd darkn3rd changed the title Regexp Not returning expected results Dgraph v20.03.0 Regexp Not returning expected results Apr 3, 2020
@lgalatin
Copy link
Contributor

Issue seems related to: #5131

@martinmr
Copy link
Contributor

I was able to reproduced this.

I noticed that the first time I ran the query, I got the expected result. I ran it again after a little bit and I then I got an empty result. The logs show rollups happened in between so I suspect that had something to do with this issue.

The second query is still working even after more rollups so I am not clear why the issue is only affecting some queries and not others.

I checked the p directory and none of the indices have split keys so this doesn't seem to be related to that. It also doesn't seem to be related to incremental rollups because #5131 is a duplicate of this issue and the reporter of that issue is using a build from December 2019.

So far it looks like there's some issue with rollups + trigram indexes. Incremental rollups might have caused the issue to bubble up more often but the issue appears to have been there for a while. I'll keep debugging.

@martinmr martinmr added priority/P0 Critical issue that requires immediate attention. status/accepted We accept to investigate/work on it. labels Apr 20, 2020
@martinmr
Copy link
Contributor

martinmr commented Apr 20, 2020

More debugging:

  1. First query works. The list of trigrams that are considered for this is below.
  2. This query causes some rollups. After the rollup, the same query will not consider the
    same number of tokens so the query does not work anymore
  3. The query that still works probably does not depend on the missing trigrams so it's unaffected.

I will look on how this list of trigrams are generated.

// First query
 16 trigram.go:58] tokens for trigram query [ALI ALi AlI Ali aLI aLi alI ali]
alpha1    | I0420 19:52:54.871338      16 trigram.go:58] tokens for trigram query [LIE LIe LiE Lie lIE lIe liE lie]
alpha1    | I0420 19:52:54.874473      16 trigram.go:58] tokens for trigram query [IEN IEn IeN Ien iEN iEn ieN ien]
a

// Second query. Note that the firs list is empty now.
alpha1    | I0420 19:53:21.075204      16 trigram.go:58] tokens for trigram query []
alpha1    | I0420 19:53:21.075254      16 trigram.go:58] tokens for trigram query [ALI ALi AlI Ali aLI aLi alI ali]
alpha1    | I0420 19:53:21.077615      16 trigram.go:58] tokens for trigram query [LIE LIe LiE Lie lIE lIe liE lie]

@martinmr
Copy link
Contributor

Found the location of the bug. In List.Uids there is what appears to be an optimization.

	if len(l.mutationMap) == 0 && opt.Intersect != nil && len(l.plist.Splits) == 0 {
		if opt.ReadTs < l.minTs {
			l.RUnlock()
			return out, ErrTsTooOld
		}
		algo.IntersectCompressedWith(l.plist.Pack, opt.AfterUid, opt.Intersect, out)
		l.RUnlock()
		return out, nil
	}

After the rollup, this branch is taken (because the mutable layer is empty after a rollup). There is a bug in IntersectCompressedWith. This is the only place where this function is called in the codebase. Commenting out this code and disabling the optimization fixes the queries.

Not sure if it's worth it to try to fix the bug in this method or simply remove the optimization. There are actually three versions of the intersection algorithm (linear, with linear jumps, and with binary search). I am still not sure which of the three has the bug but I can at least confirm that the rollup is not creating invalid data.

@danielmai
Copy link
Contributor

IntersectCompressedWith has been around for a long time. Do we know what exactly caused this issue?

@sleto-it sleto-it added this to the Dgraph v20.03.1 milestone Apr 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/querylang/function priority/P0 Critical issue that requires immediate attention. status/accepted We accept to investigate/work on it.
Development

Successfully merging a pull request may close this issue.

6 participants