The Automated Librarian: Part 2 – The Migration Engine

February 03, 2026

As I stated in my previous post, I have been collecting eBooks on various topics for several years. Some of the information is old, but it still holds immense value. The problem is simple: I know I have the information, I’m just not sure where it lies within thousands of digital pages. I needed a way to find what I was looking for without spending hours combing through manuscripts.

I knew from working with Drupal on a daily basis that it, with the addition of some other external systems, had the capability to do exactly what I wanted. But first, I had to get those files into the system. Manually uploading 180+ PDFs? Not a chance. I don't have the time or the patience for that.

This post is part of a series. Check out the full roadmap at The Automated Librarian: A Drupal 11 Data Discovery.

The Strategy: Why Migrations?

My eBooks are stored on a NAS server in a folder named "ebooks." The challenge was that the NAS is a separate machine from my Drupal site. I needed to get that folder into the Drupal public:// file schema.

I typically use Samba for cross-platform access, but symlinking Samba shares can be a nightmare for Drupal file permissions. Instead, I opted for NFS and network mounts. By creating a mount point at /mnt/media/ebooks in my fstab, the library is available upon reboot. I then symlinked that mount point into the Drupal directory structure, giving me a clean public://library_source schema without resorting to the nuclear option of 777 permissions.

Integration Options

Option Pros Cons Verdict
Manual Upload Direct Control Soul-crushing / Slow Hard Pass
Feeds Module Easy UI Requires pre-processed CSV Too much prep
Migrate API High Power / Rollbacks Custom code required Best long-term solution

I previously used the Migration API while moving a massive site from Drupal 7 to 8 and knew how powerful it could be. It was the only way to build a repeatable pipeline.

Laying the Foundation: Media Entities

Are eBooks "Content" (Nodes) or "Media"? I settled on a new Media Entity bundle called "eBook." These files are assets with specific metadata; common to literature, manuals, and guides alike.

I kept the field list narrow to keep the code focused. While it's tempting to capture every possible data point, I focused on the "Must-Haves":

  • Title & Subtitle
  • Author(s) & Publisher: Using Taxonomy terms for relational consistency
  • Publish Year & Page Count
  • Security Flag: To identify password-protected files.
  • The ISBN: Unique 10 or 13-digit identifier assigned to books for tracking worldwide.

Note: These will be the keys that unlock our Search and API steps in Parts 3 and 4!

The Ingestion Layer: The directory Source

I used the excellent Migrate Source Directory module. It allows you to use a directory of files as the source, handling the recursion and file scanning for you.

Setting up the migration file was straightforward:

id: ebook_import
label: 'Import eBooks from NFS'
source:
 plugin: directory
 directory: 'public://library_source'
 recurse: true
 file_mask: '/\.(pdf|epub)$/i'
 track_changes: true

The Brain: The Custom Process Plugin

This paved the way for the remaining fields to be populated with a custom processor: EbookMetadata. This processor relies heavily on the Kiwilan\Ebook library to extract internal metadata.

The Migration Configuration

process:
  bundle:
    plugin: default_value
    default_value: ebook

  field_media_document:
    - plugin: ebook_metadata
      source: source_file_pathname
      property: document
    - plugin: skip_on_empty
      method: process

  name:
    - plugin: ebook_metadata
      source: source_file_pathname
      property: title

  field_author:
    - plugin: ebook_metadata
      source: source_file_pathname
      property: author
    - plugin: entity_generate
      entity_type: taxonomy_term
      value_key: name
      bundle: authors
# ... [Additional mappings for Publisher, Tags, etc.]

Results: From Folders to Entities

Once the kinks were worked out, I ran the full migration. It took about 20 minutes to process the library. When it finished, I spot-checked the results.

eBook library screenshot.

The final result of the initial migration: A fully populated Drupal 11 Media Library, organized and ready for the next level of intelligence.

The good news? 20 years of files suddenly appeared in my Drupal Media Library. The bad news? Internal metadata is often "thin," missing, or flat-out wrong. I spent five hours cleaning up the library manually and realized: there has to be a better way. I don't just need the files; I need to unlock the knowledge inside those pages.

In the next part, we'll dive into the search engine's "X-ray machine"; Apache Solr and Apache Tika, to transform these binary files into a full-text searchable discovery engine.

Author

Ron Ferguson

0 Comments

Login or Register to post comments.