Full re-index of solr data on prod #1067

cdrini · 2018-09-05T18:47:48Z

cdrini · 2019-02-16T23:06:20Z

These are the other types on production solr. Need to investigate why they're there/if they should be there. Also need to investigate why there wasn't any stats data there (as I previously thought).

type: subject (1510685)
Sample:

<doc>
  <str name="key">/subjects/org:conseil_national_économique_(france)</str>
  <str name="name">Conseil national économique (France)</str>
  <str name="subject_type">org</str>
  <arr name="text">
    <str>Conseil national économique (France)</str>
    <str>/subjects/org:conseil_national_économique_(france)</str>
  </arr>
  <str name="type">subject</str>
  <int name="work_count">1</int>
</doc>

type: edition (3419)
Sample:

<doc>
  <arr name="author_key">
    <str>OL6941607A</str>
  </arr>
  <arr name="author_name">
    <str>Carlos Arturo Jiménez</str>
  </arr>
  <bool name="has_fulltext">false</bool>
  <str name="key">/books/OL25648663M</str>
  <int name="last_modified_i">1419832732</int>
  <arr name="seed">
    <str>/books/OL25648663M</str>
    <str>/works/OL15935579W</str>
    <str>/subjects/politics_and_government</str>
    <str>/subjects/presidents</str>
    <str>/subjects/frente_sandinista_de_liberación_nacional</str>
    <str>/subjects/assassination_attempts</str>
    <str>/subjects/person:daniel_ortega</str>
    <str>/subjects/person:carlos_arturo_jiménez</str>
    <str>/subjects/place:nicaragua</str>
    <str>/subjects/time:1979-1990</str>
    <str>/authors/OL6941607A</str>
  </arr>
  <arr name="text">
    <str>Nosotros no le decíamos presidente</str>
    <str>Carlos Arturo Jiménez</str>
    <str>/books/OL25648663M</str>
    <str>OL6941607A</str>
  </arr>
  <str name="title">Nosotros no le decíamos presidente</str>
  <str name="title_suggest">Nosotros no le decíamos presidente</str>
  <str name="type">edition</str>
</doc>

tfmorris · 2019-07-21T20:17:04Z

There are a set of official Docker images for Solr that we may want to consider using:
https://hub.docker.com/_/solr

cdrini · 2019-07-22T15:00:28Z

Unfortunately none of them support our current version of solr :/

tfmorris · 2019-07-22T15:34:17Z

That's because Solr 3.6 is so ancient it hasn't been supported for years. Given that Solr only supports indexes from one major release prior before requiring a complete reindex, and we're planning a reindex anyway, it seems like the perfect opportunity to upgrade to a more modern (and supported) version.

As far as I know we have a pretty vanilla installation and schema and don't make use of any exotic features which are likely to be version dependent. The current supported Solr releases are 7.7 and 8.1.

cdrini · 2019-07-22T15:42:09Z

As far as I know we have a pretty vanilla installation...

Are you willing to bet on that assumption, though? :P

Doing them together increases the risk that the reindex will have a bug and be unusable. I want to switch openlibrary to the reindex as soon as possible so that we can resolve a lot of those outdated index issues we've been having. Next step is updating the schema to better support diacritics/etc. After that updating solr version (which would require an audit of every where the solr API is used in our code to make sure the APIs in the latest version are still the same).

The full reindex is mostly automated, so takes an ~fixed amount of time. Adding new features will take developer time (which is more valuable) and has more uncertainty about how long it will take to add/guarantee those features.

tfmorris · 2019-07-22T16:34:35Z

As far as I know we have a pretty vanilla installation...

Are you willing to bet on that assumption, though? :P

I'm certainly willing to test the hypothesis. Based on my review of the 5 (!) major version upgrade notes and spot checking the upgrade notes for dozens of point releases in between, I judge the risk to be small. Facets are probably the most volatile API visible feature, but even there I didn't see anything that should impact us. A lot of the things affect clusters, replication, and other features that we don't use.

Another advantage of using a more modern version is that we get to take advantage of 7 years of performance improvements.

I want to switch openlibrary to the reindex as soon as possible so that we can resolve a lot of those outdated index issues we've been having.

Fixing the search infrastructure is a high priority, but it's valuable to keep the historical perspective in mind. Many of these problems have existed for 5+ years. Another few weeks isn't going to make or break users' perceptions of search quality on OpenLibrary.

The full reindex is mostly automated, so takes an ~fixed amount of time.

It needs to be fully automated and as lightweight as possible (preferably network independent) with no private side channel information required so that we can iterate on search improvements.

Adding new features will take developer time

True, but we've already invested the time for the main features that we want. Testing time is also significant and the more iterations we break this into, the greater the testing time required.

BTW, I'm not trying to talk anyone else into testing this. I'm happy to roll it into my testing and performance improvements. It may make sense to defer a decision until we have more supporting (or not) data.

cdrini · 2019-07-22T19:45:17Z

I still think lumping everything together is risky. Right now, we have 2 big changes: a full reindex (with lots of new code), and switching our production env to use docker (lots of room for strange errors). Hooking this up to production is crucial to fully testing this. This is essentially a refactor–we want to maintain the ~same functionality, but with changes to how the code/env works. The more changes we pile on, the harder it will be to know what is causing a bug if a bug appears.

To ~quote Martin Fowler:

How to refactor without doing more harm than good:

Don't add functionality at the same time.

Make sure your code has tests before refactoring. Run the tests frequently so you know quickly if your changes have broken something.

Take short, deliberate steps. Refactoring often involves making many localized changes that result in a larger-scale change. If you keep your steps small, and test after each step, you will avoid prolonged debugging.

So doing this in 3 stages has the benefits of:

Lower risk of bugs since fewer changes
Easier to debug since fewer changes
Gets improvements out faster
User experience improvements increases developer morale
Less likely to get blocked since there is less uncertainty in a smaller set of requirements

Doing this in 1 stage:

Have to perform only 1 instead of 3 full reindices
Larger but later site improvement
Less possible overlap (making changes which are no longer relevant with a different version of solr)

So I'm convinced that 3 stages is better ¯\_(ツ)_/¯

LeadSongDog · 2019-12-09T16:35:14Z

Gall's Law:
A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system.

In other words, baby steps, please

tfmorris · 2020-03-04T21:28:44Z

I reported on the results of my Solr 8.1 experiments many months ago but didn't update this issue, so to close the loop, re:

I'm happy to roll it into my testing and performance improvements. It may make sense to defer a decision until we have more supporting (or not) data.

#2246 includes all the necessary (very minimal) schema updates to support a modern Solr as well as the multicore changes required since there's no such thing as single core Solr any more. The commits should be easily identifiable from the commit messages, but I'm happy to break them out into a separate branch if that makes things easier.

tfmorris · 2020-03-04T21:34:06Z

So I'm convinced that 3 stages is better ¯_(ツ)_/¯

This opinion is 9 months old, so hopefully it has changed, but I think a key factor which might be being overlooked is the testing cycle. Even the "minimal" reindex is a complete reboot which will require extensive human testing to confirm that things are working as expected. It's very likely that bug fixes will, themselves, require additional complete rebuilds. Given this, I think it makes sense to bundle a reasonable amount of functionality into these heavyweight rebuilds.

cdrini · 2020-03-11T01:38:28Z

This is deployed to prod ol-web3; monitoring for issues.

cdrini · 2020-03-30T20:14:46Z

Monitoring is going well; next month will do another re-index + deploy. There are hints that there might be some perf issues, need to add more graphite logging to check. This issue is done though. More issues need to be created for those other things.

tfmorris · 2020-03-30T21:05:44Z

@cdrini Could you describe what "monitoring" means in this context and how the new index was validated to be correct and complete.

I've got to say that I'm finding this whole process quite opaque.

cdrini · 2020-04-02T23:41:53Z

The correctness of the new index was tested mostly here: #2222 ; and it was connected to 1 of our web nodes for ~3 weeks. The biggest risk of error at this point is mostly performance (which is what led to c702875, and I did notice some more peculiarities in performance even after this, but we'll get more information as it goes).

I closed this issue because a full re-index is running on production; the initial checklist on the issue had a number of issues, but I consider it done once it went to production and ran hooked to prod successfully for weeks. I need to create an issue for the next small steps (which involve removing the "old" solr entirely).

cdrini added the Module: Docker Issues related to the configuration or use of Docker. [managed] label Sep 5, 2018

cdrini self-assigned this Sep 5, 2018

tfmorris added the Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] label Dec 27, 2018

cdrini mentioned this issue Jan 21, 2019

Create flow for building a fresh solr instance from a dump file #1843

Merged

cdrini mentioned this issue Feb 17, 2019

Subjects not being indexed into Solr #1896

Open

cdrini added State: Work In Progress This issue is being actively worked on. [managed] Type: Feature labels Apr 21, 2019

brad2014 added Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] and removed Type: Feature labels Apr 23, 2019

cdrini mentioned this issue May 9, 2019

Make most SOLR fields ignore diacritics #599

Closed

cdrini added the Type: Epic A feature or refactor that is big enough to require subissues. [managed] label Jul 19, 2019

cdrini mentioned this issue Jul 19, 2019

Reindex documents into a new Solr on OJF #2222

Closed

3 tasks

cdrini added the Priority: 2 Important, as time permits. [managed] label Nov 27, 2019

mekarpeles added the Lead: @cdrini Issues overseen by Drini (Staff: Team Lead & Solr, Library Explorer, i18n) [managed] label Dec 18, 2019

cdrini modified the milestone: Next Sprint (Proposed) Feb 3, 2020

cdrini changed the title ~~Switch production solr to use Docker~~ Full re-index of solr data on prod Mar 2, 2020

cdrini added this to the Next Sprint (Proposed) milestone Mar 2, 2020

cdrini modified the milestones: Next Sprint (Proposed), Active Sprint Mar 2, 2020

cdrini mentioned this issue Mar 3, 2020

Fix solr queries not correctly encoding parameters #3117

Merged

cdrini closed this as completed Mar 30, 2020

cdrini removed the State: Work In Progress This issue is being actively worked on. [managed] label Apr 2, 2020

cdrini mentioned this issue Apr 2, 2020

Add LCC and Dewey decimal numbers to solr in April solr reindex #3290

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full re-index of solr data on prod #1067

Full re-index of solr data on prod #1067

cdrini commented Sep 5, 2018 •

edited

Loading

cdrini commented Feb 16, 2019

tfmorris commented Jul 21, 2019

cdrini commented Jul 22, 2019

tfmorris commented Jul 22, 2019

cdrini commented Jul 22, 2019

tfmorris commented Jul 22, 2019

cdrini commented Jul 22, 2019 •

edited

Loading

LeadSongDog commented Dec 9, 2019 •

edited

Loading

tfmorris commented Mar 4, 2020

tfmorris commented Mar 4, 2020

cdrini commented Mar 11, 2020

cdrini commented Mar 30, 2020

tfmorris commented Mar 30, 2020

cdrini commented Apr 2, 2020

Full re-index of solr data on prod #1067

Full re-index of solr data on prod #1067

Comments

cdrini commented Sep 5, 2018 • edited Loading

Subtasks

Notes/Comments

cdrini commented Feb 16, 2019

tfmorris commented Jul 21, 2019

cdrini commented Jul 22, 2019

tfmorris commented Jul 22, 2019

cdrini commented Jul 22, 2019

tfmorris commented Jul 22, 2019

cdrini commented Jul 22, 2019 • edited Loading

LeadSongDog commented Dec 9, 2019 • edited Loading

tfmorris commented Mar 4, 2020

tfmorris commented Mar 4, 2020

cdrini commented Mar 11, 2020

cdrini commented Mar 30, 2020

tfmorris commented Mar 30, 2020

cdrini commented Apr 2, 2020

cdrini commented Sep 5, 2018 •

edited

Loading

cdrini commented Jul 22, 2019 •

edited

Loading

LeadSongDog commented Dec 9, 2019 •

edited

Loading