The Next Gen of Data Crawlers: Java 25 Virtual Threads & Vector DBs

For years, Apache ManifoldCF has been the reliable workhorse of enterprise document ingestion. It handles complex security mappings, diverse repository connectors, and outputs to search engines with incredible depth.

But let’s face it: the Java ecosystem has changed dramatically since ManifoldCF was first designed. Traditional thread pools, XML configuration schemas, and legacy classloading mechanisms are feeling their age.

Meanwhile, the enterprise search landscape has been turned upside down by LLMs, Vector Databases, and RAG (Retrieval-Augmented Generation). That is why I started an experiment: Spring-Manifold Next-Gen.

It’s a next-generation ingestion platform built from the ground up to leverage the best of Java 25, Spring Boot, and AI vector pipelines. And I’m looking for volunteers to help build it.

What Makes it Next-Gen?

Here is a look under the hood of what makes this experiment different:

1. Concurrency Redefined: Java 25 Virtual Threads

In traditional crawlers, scanning massive filesystems or web trees means managing complex thread-pool boundaries to prevent blocking.

With Java 25, we can let the JVM handle it. Spring-Manifold Next-Gen utilizes Virtual Threads and Structured Concurrency preview features. This allows us to spawn cheap, lightweight virtual threads for every document scan task, drastically reducing overhead and simplifying core concurrency logic.

2. Built for AI and RAG out-of-the-box

Search is no longer just keyword matching. We built the platform with AI integration at its core:

  • Ollama Integration: Automated local embedding generation using open models ( mxbai-embed-large ).

  • pgvector: Native vector storage outputs in PostgreSQL.

Your documents are parsed, chunked, embedded, and indexed for semantic search in a single stream.

3. Developer-First Frameworks

We swapped out legacy custom frameworks for:

  • Spring Boot: For clean dependency injection, modern REST APIs, and simple configuration.

  • React + Vite + TailwindCSS: A clean, responsive dashboard to monitor ingestion jobs, replacing server-rendered Velocity templates.

The Current State

Right now, the project is a functional skeleton. We have:

The core multi-module Maven structure.

  • A basic Filesystem Repository Connector scanning directory trees using virtual threads.

  • A Vector Output Connector pushing parsed text and embeddings to pgvector.

  • A local development infrastructure via Docker Compose.

  • An initial React dashboard to visualize setup.

But to make this a true successor to enterprise ingestion, we need your help.

Join the Project: Call for Volunteers

This is an open-source experiment, and we are looking for developers, architects, and designers to join the fun. Whether you want to learn Java 25 preview features or apply your frontend skills to open source, there’s a place for you:

🛠️ Backend & Core Engineers

Connectors, Connectors, Connectors: We need repository connectors for S3, SharePoint, databases, Google Drive, and more.

  • Security & ACLs: Porting ManifoldCF's legendary security/ACL mapping model to a modern Spring Security context.

  • Job Scheduler: Improving the orchestrator to support cron scheduling and retry policies.

🎨 Frontend UI Developers

  • Admin Dashboard: Elevate our React + Vite frontend to build a modern dashboard for connector configuration and job monitoring.

🤖 AI / Data Engineers

  • Embedding Pipelines: Optimize text chunking strategies, metadata extraction, and support for cloud embedding APIs (OpenAI, Cohere, HuggingFace).

How to Get Started

Setting up the project takes less than 5 minutes:

  1. Clone the repository:

    git clone https://github.com/your-username/spring-manifold-nextgen.git
  2. Start the local infrastructure:

    docker compose up -d
  3. Pull the embedding model:

    docker exec -it ollama ollama pull mxbai-embed-large
  4. Build and run!

Check out our contribution guidelines at CONTRIBUTING.md on our GitHub page:

https://github.com/OpenPj/spring-manifold-next-gen

Let's build the future of open source document ingestion together.

Drop a comment, open an issue, or submit a pull request!

Spring-Manifold Next-Gen on GitHub