Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full re-index of solr data on prod #1067

Closed
8 of 14 tasks
cdrini opened this issue Sep 5, 2018 · 14 comments
Closed
8 of 14 tasks

Full re-index of solr data on prod #1067

cdrini opened this issue Sep 5, 2018 · 14 comments
Assignees
Labels
Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Docker Issues related to the configuration or use of Docker. [managed] Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Priority: 2 Important, as time permits. [managed] Type: Epic A feature or refactor that is big enough to require subissues. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]

Comments

@cdrini
Copy link
Collaborator

cdrini commented Sep 5, 2018

This will be an important step into having a more reliable solr environment. Being able to locally create an identical solr environment will get rid of a lot of confusion. It would also allow us a path to move forward on #178 and #599 , since we can spin up a new solr, re-index it with the new settings, and then swap it with the old solr without any downtime.

Subtasks

  • Solr Docker image #1055 Create docker image for solr
  • Determine data on production solr
    • Why are there type: subject? This looks like it's used for /search/subjects, so these needed to be included.
    • Why are there type: edition? This looks like residuals of dead code for /search/editions (which does appear to work for the measly ~3.5K editions stored in solr)
    • Why isn't there any stats related data? /solr/process_stats.py looks like dead code.
    • Ensure dev's config file is the same as prod's. -> Copied from prod into solrbuilder, so they will be identical.
  • Create test solr on server.openjournal.foundation Reindex documents into a new Solr on OJF #2222
  • Create Docker-based solr for production use
    • Create solr environment on prod somewhere
    • Pause both solrupdaters
    • Copy OJF solr data to new prod environment
    • Link production to new solr endpoint
    • Destroy old solr endpoint

Notes/Comments

  • I believe solr is storing viewage statistics as well as just works/authors themselves
    • @mekarpeles Can you run this query on production solr: NOT(type:work) AND NOT(type:author)?
@cdrini cdrini added the Module: Docker Issues related to the configuration or use of Docker. [managed] label Sep 5, 2018
@cdrini cdrini self-assigned this Sep 5, 2018
@tfmorris tfmorris added the Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] label Dec 27, 2018
@cdrini
Copy link
Collaborator Author

cdrini commented Feb 16, 2019

These are the other types on production solr. Need to investigate why they're there/if they should be there. Also need to investigate why there wasn't any stats data there (as I previously thought).

type: subject (1510685)
Sample:

<doc>
  <str name="key">/subjects/org:conseil_national_économique_(france)</str>
  <str name="name">Conseil national économique (France)</str>
  <str name="subject_type">org</str>
  <arr name="text">
    <str>Conseil national économique (France)</str>
    <str>/subjects/org:conseil_national_économique_(france)</str>
  </arr>
  <str name="type">subject</str>
  <int name="work_count">1</int>
</doc>

type: edition (3419)
Sample:

<doc>
  <arr name="author_key">
    <str>OL6941607A</str>
  </arr>
  <arr name="author_name">
    <str>Carlos Arturo Jiménez</str>
  </arr>
  <bool name="has_fulltext">false</bool>
  <str name="key">/books/OL25648663M</str>
  <int name="last_modified_i">1419832732</int>
  <arr name="seed">
    <str>/books/OL25648663M</str>
    <str>/works/OL15935579W</str>
    <str>/subjects/politics_and_government</str>
    <str>/subjects/presidents</str>
    <str>/subjects/frente_sandinista_de_liberación_nacional</str>
    <str>/subjects/assassination_attempts</str>
    <str>/subjects/person:daniel_ortega</str>
    <str>/subjects/person:carlos_arturo_jiménez</str>
    <str>/subjects/place:nicaragua</str>
    <str>/subjects/time:1979-1990</str>
    <str>/authors/OL6941607A</str>
  </arr>
  <arr name="text">
    <str>Nosotros no le decíamos presidente</str>
    <str>Carlos Arturo Jiménez</str>
    <str>/books/OL25648663M</str>
    <str>OL6941607A</str>
  </arr>
  <str name="title">Nosotros no le decíamos presidente</str>
  <str name="title_suggest">Nosotros no le decíamos presidente</str>
  <str name="type">edition</str>
</doc>

@cdrini cdrini added State: Work In Progress This issue is being actively worked on. [managed] Type: Feature labels Apr 21, 2019
@brad2014 brad2014 added Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] and removed Type: Feature labels Apr 23, 2019
@cdrini cdrini added the Type: Epic A feature or refactor that is big enough to require subissues. [managed] label Jul 19, 2019
@tfmorris
Copy link
Contributor

There are a set of official Docker images for Solr that we may want to consider using:
https://hub.docker.com/_/solr

@cdrini
Copy link
Collaborator Author

cdrini commented Jul 22, 2019

Unfortunately none of them support our current version of solr :/

@tfmorris
Copy link
Contributor

That's because Solr 3.6 is so ancient it hasn't been supported for years. Given that Solr only supports indexes from one major release prior before requiring a complete reindex, and we're planning a reindex anyway, it seems like the perfect opportunity to upgrade to a more modern (and supported) version.

As far as I know we have a pretty vanilla installation and schema and don't make use of any exotic features which are likely to be version dependent. The current supported Solr releases are 7.7 and 8.1.

@cdrini
Copy link
Collaborator Author

cdrini commented Jul 22, 2019

As far as I know we have a pretty vanilla installation...

Are you willing to bet on that assumption, though? :P

Doing them together increases the risk that the reindex will have a bug and be unusable. I want to switch openlibrary to the reindex as soon as possible so that we can resolve a lot of those outdated index issues we've been having. Next step is updating the schema to better support diacritics/etc. After that updating solr version (which would require an audit of every where the solr API is used in our code to make sure the APIs in the latest version are still the same).

The full reindex is mostly automated, so takes an ~fixed amount of time. Adding new features will take developer time (which is more valuable) and has more uncertainty about how long it will take to add/guarantee those features.

@tfmorris
Copy link
Contributor

As far as I know we have a pretty vanilla installation...

Are you willing to bet on that assumption, though? :P

I'm certainly willing to test the hypothesis. Based on my review of the 5 (!) major version upgrade notes and spot checking the upgrade notes for dozens of point releases in between, I judge the risk to be small. Facets are probably the most volatile API visible feature, but even there I didn't see anything that should impact us. A lot of the things affect clusters, replication, and other features that we don't use.

Another advantage of using a more modern version is that we get to take advantage of 7 years of performance improvements.

I want to switch openlibrary to the reindex as soon as possible so that we can resolve a lot of those outdated index issues we've been having.

Fixing the search infrastructure is a high priority, but it's valuable to keep the historical perspective in mind. Many of these problems have existed for 5+ years. Another few weeks isn't going to make or break users' perceptions of search quality on OpenLibrary.

The full reindex is mostly automated, so takes an ~fixed amount of time.

It needs to be fully automated and as lightweight as possible (preferably network independent) with no private side channel information required so that we can iterate on search improvements.

Adding new features will take developer time

True, but we've already invested the time for the main features that we want. Testing time is also significant and the more iterations we break this into, the greater the testing time required.

BTW, I'm not trying to talk anyone else into testing this. I'm happy to roll it into my testing and performance improvements. It may make sense to defer a decision until we have more supporting (or not) data.

@cdrini
Copy link
Collaborator Author

cdrini commented Jul 22, 2019

I still think lumping everything together is risky. Right now, we have 2 big changes: a full reindex (with lots of new code), and switching our production env to use docker (lots of room for strange errors). Hooking this up to production is crucial to fully testing this. This is essentially a refactor–we want to maintain the ~same functionality, but with changes to how the code/env works. The more changes we pile on, the harder it will be to know what is causing a bug if a bug appears.

To ~quote Martin Fowler:

How to refactor without doing more harm than good:

  • Don't add functionality at the same time.
  • Make sure your code has tests before refactoring. Run the tests frequently so you know quickly if your changes have broken something.
  • Take short, deliberate steps. Refactoring often involves making many localized changes that result in a larger-scale change. If you keep your steps small, and test after each step, you will avoid prolonged debugging.

So doing this in 3 stages has the benefits of:

  • Lower risk of bugs since fewer changes
  • Easier to debug since fewer changes
  • Gets improvements out faster
  • User experience improvements increases developer morale
  • Less likely to get blocked since there is less uncertainty in a smaller set of requirements

Doing this in 1 stage:

  • Have to perform only 1 instead of 3 full reindices
  • Larger but later site improvement
  • Less possible overlap (making changes which are no longer relevant with a different version of solr)

So I'm convinced that 3 stages is better ¯\_(ツ)_/¯

@cdrini cdrini added the Priority: 2 Important, as time permits. [managed] label Nov 27, 2019
@LeadSongDog
Copy link

LeadSongDog commented Dec 9, 2019

Gall's Law:
A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system.

In other words, baby steps, please

@mekarpeles mekarpeles added the Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] label Dec 18, 2019
@cdrini cdrini modified the milestone: Next Sprint (Proposed) Feb 3, 2020
@cdrini cdrini changed the title Switch production solr to use Docker Full re-index of solr data on prod Mar 2, 2020
@cdrini cdrini added this to the Next Sprint (Proposed) milestone Mar 2, 2020
@tfmorris
Copy link
Contributor

tfmorris commented Mar 4, 2020

I reported on the results of my Solr 8.1 experiments many months ago but didn't update this issue, so to close the loop, re:

I'm happy to roll it into my testing and performance improvements. It may make sense to defer a decision until we have more supporting (or not) data.

#2246 includes all the necessary (very minimal) schema updates to support a modern Solr as well as the multicore changes required since there's no such thing as single core Solr any more. The commits should be easily identifiable from the commit messages, but I'm happy to break them out into a separate branch if that makes things easier.

@tfmorris
Copy link
Contributor

tfmorris commented Mar 4, 2020

So I'm convinced that 3 stages is better ¯_(ツ)_/¯

This opinion is 9 months old, so hopefully it has changed, but I think a key factor which might be being overlooked is the testing cycle. Even the "minimal" reindex is a complete reboot which will require extensive human testing to confirm that things are working as expected. It's very likely that bug fixes will, themselves, require additional complete rebuilds. Given this, I think it makes sense to bundle a reasonable amount of functionality into these heavyweight rebuilds.

@cdrini
Copy link
Collaborator Author

cdrini commented Mar 11, 2020

This is deployed to prod ol-web3; monitoring for issues.

@cdrini
Copy link
Collaborator Author

cdrini commented Mar 30, 2020

Monitoring is going well; next month will do another re-index + deploy. There are hints that there might be some perf issues, need to add more graphite logging to check. This issue is done though. More issues need to be created for those other things.

@cdrini cdrini closed this as completed Mar 30, 2020
@tfmorris
Copy link
Contributor

@cdrini Could you describe what "monitoring" means in this context and how the new index was validated to be correct and complete.

I've got to say that I'm finding this whole process quite opaque.

@cdrini cdrini removed the State: Work In Progress This issue is being actively worked on. [managed] label Apr 2, 2020
@cdrini
Copy link
Collaborator Author

cdrini commented Apr 2, 2020

The correctness of the new index was tested mostly here: #2222 ; and it was connected to 1 of our web nodes for ~3 weeks. The biggest risk of error at this point is mostly performance (which is what led to c702875, and I did notice some more peculiarities in performance even after this, but we'll get more information as it goes).

I closed this issue because a full re-index is running on production; the initial checklist on the issue had a number of issues, but I consider it done once it went to production and ran hooked to prod successfully for weeks. I need to create an issue for the next small steps (which involve removing the "old" solr entirely).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] Module: Docker Issues related to the configuration or use of Docker. [managed] Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Priority: 2 Important, as time permits. [managed] Type: Epic A feature or refactor that is big enough to require subissues. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]
Projects
None yet
Development

No branches or pull requests

5 participants