As I stated in my previous post, I have been collecting eBooks on various topics for several years. Some of the information is old, but it still holds immense value. The problem is simple: I know I have the information, I’m just not sure where it lies within thousands of digital pages. I needed a way to find what I was looking for without spending hours combing through manuscripts.
I knew from working with Drupal on a daily basis that it, with the addition of some other external systems, had the capability to do exactly what I wanted. But first, I had to get those files into the system. Manually uploading 180+ PDFs? Not a chance. I don't have the time or the patience for that.
This post is part of a series. Check out the full roadmap at The Automated Librarian: A Drupal 11 Data Discovery.
The Strategy: Why Migrations?
My eBooks are stored on a NAS server in a folder named "ebooks." The challenge was that the NAS is a separate machine from my Drupal site. I needed to get that folder into the Drupal public:// file schema.
I typically use Samba for cross-platform access, but symlinking Samba shares can be a nightmare for Drupal file permissions. Instead, I opted for NFS and network mounts. By creating a mount point at /mnt/media/ebooks in my fstab, the library is available upon reboot. I then symlinked that mount point into the Drupal directory structure, giving me a clean public://library_source schema without resorting to the nuclear option of 777 permissions.
Integration Options
| Option | Pros | Cons | Verdict |
|---|---|---|---|
| Manual Upload | Direct Control | Soul-crushing / Slow | Hard Pass |
| Feeds Module | Easy UI | Requires pre-processed CSV | Too much prep |
| Migrate API | High Power / Rollbacks | Custom code required | Best long-term solution |
I previously used the Migration API while moving a massive site from Drupal 7 to 8 and knew how powerful it could be. It was the only way to build a repeatable pipeline.
Laying the Foundation: Media Entities
Are eBooks "Content" (Nodes) or "Media"? I settled on a new Media Entity bundle called "eBook." These files are assets with specific metadata; common to literature, manuals, and guides alike.
I kept the field list narrow to keep the code focused. While it's tempting to capture every possible data point, I focused on the "Must-Haves":
- Title & Subtitle
- Author(s) & Publisher: Using Taxonomy terms for relational consistency
- Publish Year & Page Count
- Security Flag: To identify password-protected files.
- The ISBN: Unique 10 or 13-digit identifier assigned to books for tracking worldwide.
Note: These will be the keys that unlock our Search and API steps in Parts 3 and 4!
The Ingestion Layer: The directory Source
I used the excellent Migrate Source Directory module. It allows you to use a directory of files as the source, handling the recursion and file scanning for you.
Setting up the migration file was straightforward:
id: ebook_import
label: 'Import eBooks from NFS'
source:
plugin: directory
directory: 'public://library_source'
recurse: true
file_mask: '/\.(pdf|epub)$/i'
track_changes: true
The Brain: The Custom Process Plugin
This paved the way for the remaining fields to be populated with a custom processor: EbookMetadata. This processor relies heavily on the Kiwilan\Ebook library to extract internal metadata.
The Migration Configuration
process:
bundle:
plugin: default_value
default_value: ebook
field_media_document:
- plugin: ebook_metadata
source: source_file_pathname
property: document
- plugin: skip_on_empty
method: process
name:
- plugin: ebook_metadata
source: source_file_pathname
property: title
field_author:
- plugin: ebook_metadata
source: source_file_pathname
property: author
- plugin: entity_generate
entity_type: taxonomy_term
value_key: name
bundle: authors
# ... [Additional mappings for Publisher, Tags, etc.]
Results: From Folders to Entities
Once the kinks were worked out, I ran the full migration. It took about 20 minutes to process the library. When it finished, I spot-checked the results.
The good news? 20 years of files suddenly appeared in my Drupal Media Library. The bad news? Internal metadata is often "thin," missing, or flat-out wrong. I spent five hours cleaning up the library manually and realized: there has to be a better way. I don't just need the files; I need to unlock the knowledge inside those pages.
In the next part, we'll dive into the search engine's "X-ray machine"; Apache Solr and Apache Tika, to transform these binary files into a full-text searchable discovery engine.
0 Comments
Login or Register to post comments.