Development

The Automated Librarian: Part 3 - Indexing PDF Content with Solr & Tika in Drupal 11

February 11, 2026

I had successfully migrated over 180 eBooks into my Drupal 11 Media Library. On the surface, it looked perfect: clean rows of entities, organized taxonomy terms, and a stable NFS mount. But as I started clicking through, the reality set in.

I had built a useful digital bookshelf, but every book on it was wrapped in duct tape.

eBook Library after importing my data.

I knew I had a chapter on "event-driven architecture" buried somewhere in those thousands of digital pages, but Drupal’s standard database search was blind to it. It could see the filename and my (admittedly thin) metadata, but the actual knowledge was trapped inside binary PDF blobs. To find the gold, I didn't just need a catalog; I needed X-ray vision.

This post is part of a series. Check out the full roadmap at The Automated Librarian: A Drupal 11 Data Discovery.

The Search for a Specialized Librarian

In the world of Drupal, if you want to search inside files, you don't look at the database. You have to look inside the book. To do that, you need a specialized toolset starting with Apache Solr.

Solr is the high-speed "brain" that indexes content in a way that makes deep-text searching instantaneous. But Solr by itself doesn't know how to "read" a PDF or an EPUB. For that, you need Apache Tika—the "universal translator" of the digital world. Tika cracks open the files, extracts the raw text and hidden metadata, and hands it to Solr on a silver platter.

I’ll be honest, I am a Drupal developer, not a Solr administrator. Configuring Solr cores and Tika extractors can quickly descend into a rabbit hole of XML files and Java heap errors so I won’t go into much detail on that topic.

When I hit a wall with the Solr server configuration, I turned to a resource that has saved me more times than I can count: Ivan Zugec over at WebWash. Specifically, I followed his guide: How to Index and Search PDF Files in Drupal saving me hours of trial and error.

Integrating the Brain: The Configuration

With the server humming, I had to tell Drupal how to talk to it. This happens through the Search API module. Think of the Search API as the translator: it takes your Drupal Media entities, strips them down to the data Solr needs, and pushes them across the wire.

1. Mapping the Fields

I didn't just want to search the "Body" of the text. To make the library truly searchable, I indexed a mix of metadata and extracted content.

Field Name	Type	Purpose
Name (Title)	Fulltext	The primary identity of the book.
Extracted Text	Fulltext	The actual "meat" inside the PDF/EPUB, pulled by Tika.

2. The Art of the "Boost"

In search, not all matches are created equal. If I search for "Docker," a book with "Docker" in the Title is far more relevant than a book that simply mentions "Docker" once on page 300. I applied a Weighted Boost to ensure the most relevant results bubbled to the top:

Name (Title): Boost: 8.0 — This is the "Heavy Hitter." If the keyword is in the title, it’s almost certainly what the user wants.
Extracted Text: Boost: 5.0 — The baseline. We want to find the text, but it shouldn't outweigh a title match.

Pro Tip: Only fields set to the Fulltext data type can be boosted. If your field is set to "String," Solr treats it as a single literal value — great for filtering (Facets) but useless for keyword relevancy.

3. The Extraction Pipeline

This is where the Search API Attachments module comes in. I configured it to use the "Solr Extractor" method. Instead of my web server trying to parse a 50MB PDF, Drupal simply tells Solr: "Here is a file; you have Tika over there, you figure it out." Solr does the heavy lifting, extracts the text, and stores it in the extra_query field. This kept the migration fast and my server's CPU usage low.

Indexing Is No Good Without A Window

Building the engine was only half the battle. Now, I needed a dashboard — a way to actually interact with all that indexed data. This is where Drupal Views and the 'Fulltext search' filter come into play.

To do this, I created a new View and in the view settings, I selected the “Index ebook index” so Drupal knows to look at the Solr indexed data for what I’m looking for.

The Fields are pretty much a preference … you can add whatever you like in your display. The key here is the Filter Criteria. You have to add the Fulltext search field and check the box labeled “Expose this filter to visitors, to allow them to change it”. Once you do this, you will have several options for the full text search.

The Parse mode is where the magic happens. The default is “Single phrase” and this may be fine for a general search, however I didn't just want a search box; I wanted a command line. By switching the Parse mode to 'Direct query,' I gained the ability to use operators like 'Drupal Solr -Docker' to filter out the noise on the fly. (There are other options for this as well and they return different result sets. You just need to play with each one to see if it fits your specific needs and expectations.)

Search API Fulltext search configuration. — Drupal 11 Search API View configuration showing Fulltext search filter and Direct Query parse mode settings.

The "Aha!" Moment

I triggered the indexer and watched the progress bar crawl. It wasn't just checking boxes; it was digesting years of technical manuals.

Once finished, I went to my search view and typed "service container." In milliseconds, the system didn't just return a list of books with that title. It gave me excerpts from Pages 7 and 17 of a Drupal 10 Module Development Guide and Page 37 of Symfony: The Book.

eBook search screenshot. — Drupal 11 search results page showing highlighted text excerpts from indexed PDF files using Solr and Tika.

The "Black Box" was officially open!

What’s Next?

We have the files, and we finally have the ability to search deep within them. But there’s still a glaring issue. As I mentioned at the end of Part 2, my internal metadata—the authors, publishers, and ISBNs—is often missing or incorrect.

Unlocking the text was the hard part, but our library catalog is still a mess of missing authors, missing descriptions, and mystery ISBNs. In the next installment, we’re going to stop manual entry and start automating. I'll be connecting to AI LLMs and an external API to fetch rich metadata, turning our simple search into a truly intelligent discovery engine. Stay tuned!