About a month ago, I moved my 2nd custom Drupal module into production. Generally, I am very, very against writing custom modules. When you are supporting 30+ sites, 3+ multi-site installations, and those numbers continue to grow, it's a very bad idea to write large amounts of custom code that you will then have to support and upgrade forever.
However - this client wanted to be able to search PDF files. We had made the mistake in the year prior of offering this service using the Swish-E Indexer module. http://drupal.org/project/swish
This module had a lot going for it - but there were a couple of things that in the end made it a poor solution. First off, the index was not smart enough to only index changed content. All 300+ very large PDFs within the site were being re-indexed EVERY TIME CRON RAN. This meant that it would time out regularly and it was impacting performance across the server. Sadly, unless the indexer was being ran, it didn't remove deleted files from the search results, so 404 errors started popping up. The final straw was not having an official Drupal 6 release by the time we needed to upgrade the site.
My preference would have been to drop search indexing of files all together - but that wasn't an option. So, I took a few of the key elements of Swish-E Indexer and wrote a much much simpler version that uses the same text extraction method but instead of using custom indexing, I use a computed field (http://drupal.org/project/computed_field) to call the text extract function on uploaded files (uploaded using the core upload module) and save the text results within the computed field. Then, I let Drupal's built-in search functionality index the content.
Here is the code I used in the computed field:
<?php
$int = 0;
if (!empty($node->files)) {
foreach ($node->files as $file){
if (is_object($file)) {
$file = (array) $file;
}
$text = pdftotext_do_text_extract($file['filepath']);
if (!empty($text)) {
$node_field[$int]['value'] = check_plain($text);
}
else {
$node_field[$int]['value'] = 'The file could not be converted to text.';
}
$int ++;
}
}
?>