[PR #2079] [MERGED] New Script: Apache Tika #3407

Closed
opened 2025-11-20 06:04:38 -05:00 by saavagebueno · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/community-scripts/ProxmoxVE/pull/2079
Author: @andygrunwald
Created: 2/6/2025
Status: Merged
Merged: 2/6/2025
Merged by: @tremor021

Base: mainHead: new-script-apache-tika


📝 Commits (10+)

  • db98a94 New Script: Apache Tika
  • f1c5d87 Temp: Replace github URLs to my own fork
  • 8220cf2 Add additional dependencies according to the Docker image installation
  • 9a3e327 Apache Tika: Set correct tags
  • 29c1b47 Apache Tika: Set TODO to make it updateable
  • c231796 Apache Tika: Fix "software-properties-common: command not found"
  • eb303ee Apache Tika: Automate version detection
  • f3bc197 Apache Tika: Add update_script
  • d107f3d Apache Tika: Added clean up of /opt/apache-tika/tika-server-standard-prev-version.jar after upgrade
  • 973974a Apache Tika: Bump up ram to 2048

📊 Changes

3 files changed (+181 additions, -0 deletions)

View changed files

ct/apache-tika.sh (+69 -0)
install/apache-tika-install.sh (+78 -0)
json/apache-tika.json (+34 -0)

📄 Description

✍️ Description

Introducing Apache Tika (https://tika.apache.org/) as a new LXC container.

Paperless-NGX can use Apache Tika for additional OCR on Office files (like OpenOffice, MS Office, etc.).



Prerequisites

  • Self-review performed (I have reviewed my code to ensure it follows established patterns and conventions.)
  • [] Testing performed (I have thoroughly tested my changes and verified expected functionality.)

🛠️ Type of Change

  • [] Bug fix (non-breaking change that resolves an issue)
  • [] New feature (non-breaking change that adds functionality)
  • [] Breaking change (fix or feature that would cause existing functionality to change unexpectedly)
  • New script (a fully functional and thoroughly tested script or set of scripts)

📋 Additional Information (optional)

Version detection

Right now, the version of Tika is hardcoded as there is no automatic detection of it yet.
My question would be:

Is this fine to get started or is an automatic detection required? If the latter, can you provide some hints on your favorable implementation to avoid ping-pong?

The upgradable functionality depends on the version detection. I can provide the functionality once the version detection works.

Testing

The LXC Container setup is working as expected already.

Before merge, I would like to extend my testing of this setup by integrating it into Paperless to see if everything is complete.

Once done, I will revert f1c5d87206.

Resource usage

The entered resources are a guess right now.
I don't have experience if this is over- or undersized.

Configuration

There is the possibility to configure Apache Tika.
This script is not offering a default config file, as Apache Tika is shipped with sane defaults that should be a good choice for the majority of users.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/community-scripts/ProxmoxVE/pull/2079 **Author:** [@andygrunwald](https://github.com/andygrunwald) **Created:** 2/6/2025 **Status:** ✅ Merged **Merged:** 2/6/2025 **Merged by:** [@tremor021](https://github.com/tremor021) **Base:** `main` ← **Head:** `new-script-apache-tika` --- ### 📝 Commits (10+) - [`db98a94`](https://github.com/community-scripts/ProxmoxVE/commit/db98a944b182f24df36bea26c4d5e1c7fbeda263) New Script: Apache Tika - [`f1c5d87`](https://github.com/community-scripts/ProxmoxVE/commit/f1c5d8720696cdbfde9471abfff07e4b7b71bc6d) Temp: Replace github URLs to my own fork - [`8220cf2`](https://github.com/community-scripts/ProxmoxVE/commit/8220cf225ba3de87b58d38dd6587366a9ba4fb02) Add additional dependencies according to the Docker image installation - [`9a3e327`](https://github.com/community-scripts/ProxmoxVE/commit/9a3e3279f1904ca256ea67a9ef3fe9e3612b096e) Apache Tika: Set correct tags - [`29c1b47`](https://github.com/community-scripts/ProxmoxVE/commit/29c1b4738b28d7247e1dd906f734ef34fb93bdb4) Apache Tika: Set TODO to make it updateable - [`c231796`](https://github.com/community-scripts/ProxmoxVE/commit/c23179694d5306929e6f5c60b9b648d9fae8f793) Apache Tika: Fix "software-properties-common: command not found" - [`eb303ee`](https://github.com/community-scripts/ProxmoxVE/commit/eb303eec00cfcd7d3e44c1a669f95f19e68ad9fa) Apache Tika: Automate version detection - [`f3bc197`](https://github.com/community-scripts/ProxmoxVE/commit/f3bc197e25c240c8cd2fb5f4f3c248af33a5ad74) Apache Tika: Add `update_script` - [`d107f3d`](https://github.com/community-scripts/ProxmoxVE/commit/d107f3d0908621b62de9558e2d60d22f459be716) Apache Tika: Added clean up of `/opt/apache-tika/tika-server-standard-prev-version.jar` after upgrade - [`973974a`](https://github.com/community-scripts/ProxmoxVE/commit/973974ac922ca2247aab6daea310311f96547930) Apache Tika: Bump up ram to 2048 ### 📊 Changes **3 files changed** (+181 additions, -0 deletions) <details> <summary>View changed files</summary> ➕ `ct/apache-tika.sh` (+69 -0) ➕ `install/apache-tika-install.sh` (+78 -0) ➕ `json/apache-tika.json` (+34 -0) </details> ### 📄 Description ## ✍️ Description Introducing Apache Tika (https://tika.apache.org/) as a new LXC container. Paperless-NGX can use Apache Tika for additional OCR on Office files (like OpenOffice, MS Office, etc.). - - - - Related Discussion: https://github.com/community-scripts/ProxmoxVE/discussions/499 and https://github.com/community-scripts/ProxmoxVE/discussions/1112 - - - ## ✅ Prerequisites - [X] Self-review performed (I have reviewed my code to ensure it follows established patterns and conventions.) - [] Testing performed (I have thoroughly tested my changes and verified expected functionality.) ## 🛠️ Type of Change - [] Bug fix (non-breaking change that resolves an issue) - [] New feature (non-breaking change that adds functionality) - [] Breaking change (fix or feature that would cause existing functionality to change unexpectedly) - [X] New script (a fully functional and thoroughly tested script or set of scripts) --- ## 📋 Additional Information (optional) ### Version detection Right now, the version of Tika is hardcoded as there is no automatic detection of it yet. My question would be: > ❓ Is this fine to get started or is an automatic detection required? If the latter, can you provide some hints on your favorable implementation to avoid ping-pong? The upgradable functionality depends on the version detection. I can provide the functionality once the version detection works. ### Testing The LXC Container setup is working as expected already. Before merge, I would like to extend my testing of this setup by integrating it into Paperless to see if everything is complete. Once done, I will revert https://github.com/community-scripts/ProxmoxVE/commit/f1c5d8720696cdbfde9471abfff07e4b7b71bc6d. ### Resource usage The entered resources are a guess right now. I don't have experience if this is over- or undersized. ### Configuration There is the possibility to configure Apache Tika. This script is not offering a default config file, as Apache Tika is shipped with sane defaults that should be a good choice for the majority of users. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
saavagebueno added the pull-request label 2025-11-20 06:04:38 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: SVI/ProxmoxVE#3407