Skip to content

docs(search): add note about re-indexing when enabling Tika#2285

Draft
michaelstingl wants to merge 1 commit intoopencloud-eu:mainfrom
michaelstingl:docs/search-reindex-tika-note
Draft

docs(search): add note about re-indexing when enabling Tika#2285
michaelstingl wants to merge 1 commit intoopencloud-eu:mainfrom
michaelstingl:docs/search-reindex-tika-note

Conversation

@michaelstingl
Copy link
Contributor

@michaelstingl michaelstingl commented Feb 5, 2026

Description

Add notes to the search service README clarifying that:

  1. Enabling Tika does not automatically re-extract content from already indexed files
  2. The opencloud search index --all-spaces command skips files with unchanged modification time
  3. Workaround: delete the Bleve search index before re-indexing to force full content extraction

Related Issue

Motivation and Context

When users enable Tika on an existing instance, they expect full-text search to work for all files. However, opencloud search index --all-spaces skips files already in the index (mtime-based check in services/search/pkg/search/service.go), so the Tika extractor is never called for previously indexed files. This is undocumented and confusing.

How Has This Been Tested?

  • test environment: Read the source code in services/search/pkg/search/service.go (IndexSpace method, mtime skip logic at line ~495)
  • test case 1: Verified the skip behavior by tracing the code path: IndexSpace → Walk → mtime check → skip if already indexed
  • test case 2: Confirmed no --force flag exists in the CLI or protobuf definition (IndexSpaceRequest)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Technical debt
  • Tests only (no source changes)

Checklist:

  • Code changes
  • Unit tests added
  • Acceptance tests added
  • Documentation added

@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 5, 2026

opencloud search index --all-spaces
```

> **Note:** The re-index command skips files whose modification time has not changed since they were last indexed. If you changed the extractor type (e.g., from `basic` to `tika`), you need to delete the existing search index first to force a full content re-extraction:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScharfViktor Is that true? I think we need to verify this first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that is true. I can reproduce it

@michaelstingl

This comment was marked as outdated.

> **Note:** The re-index command skips files whose modification time has not changed since they were last indexed. If you changed the extractor type (e.g., from `basic` to `tika`), you need to delete the existing search index first to force a full content re-extraction:
>
> ```shell
> rm -rf $OC_BASE_DATA_PATH/search # default: /var/lib/opencloud/search
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@micbar maybe it is bug? I expect re-index without deleting /search

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aduffeck can you clarify? You know the implementation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the current behavior, yes. I consider that a bug.

A re-index should unconditionally rebuild the index for the space/all space in my opinion. Maybe it would be helpful to also have a command or flag for just "syncing" the index, i.e. picking up changes that haven't been indexed yet (the current behavior), but that shouldn't be the default behavior of the index command.

@michaelstingl michaelstingl marked this pull request as draft February 6, 2026 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

How to verify tika is indexing files

4 participants