NAME
File::Extract - Extract Text From Arbitrary File Types
SYNOPSIS
use File::Extract;
my $e = File::Extract->new();
my $r = $e->extract($filename);
my $e = File::Extract->new(encodings => [...]);
my $class = "MyExtractor";
File::Extract->register_processor($class);
my $filter = MyCustomFilter->new;
File::Extact->register_filter($mime_type => $filter);
DESCRIPTION
File::Extract is a framework to extract text data out of arbitrary file
types, useful to collect data for indexing.
CLASS METHODS
register_processor($class)
Registers a new text-extractor. The processor is used as the default
processor for a given MIME type, but it can be overridden by specifying
the 'processors' parameter
The specified class needs to implement two functions:
mime_type(void)
Returns the MIME type that $class can extract files from.
extract($file)
Extracts the text from $file. Returns a File::Extract::Result
object.
register_filter($mime_type, $filter)
Registers a filter to be used when a particular mime type has been
found.
METHODS
new(%args)
magic
Returns the File::MMagic::XS object that used by the object. Use
this to modify, set options, etc. E.g.:
my $extract = File::Extract->new(...);
$extract->magic->add_file_ext(t => 'text/perl-test');
$extract->extract(...);
filters
A hashref of filters to be applied before attempting to extract the
text out of it.
Here's a trivial example that puts line numbers in the beginning of
each line before extracting the output out of it.
use File::Extract;
use File::Extract::Filter::Exec;
my $extract = File::Extract->new(
filters => {
'text/plain' => [
File::Extract::Filter::Exec->new(cmd => "perl -pe 's/^/\$. /'")
]
}
);
my $r = $extract->extract($file);
processors
A list of processors to be used for this instance. This overrides
any processors that were registered previously via
register_processor() class method.
encodings
List of encodings that you expect your files to be in. This is used
to re-encode and normalize the contents of the file via
Encode::Guess.
output_encoding
The final encoding that you the extracted test to be in. The default
encoding is UTF8.
extract($file)
SEE ALSO
File::MMagic::XS
AUTHOR
Copyright 2005 Daisuke Maki <dmaki@cpan.org>. All rights reserved.
Development funded by Brazil, Ltd. <http://b.razil.jp>