Lucy::Analysis::Token - Unit of text.
my $token = Lucy::Analysis::Token->new( text => 'blind', start_offset => 8, end_offset => 13, ); $token->set_text('mice');
Token is the fundamental unit used by Apache Lucy’s Analyzer subclasses.
Each Token has 5 attributes: text
,
start_offset
,
end_offset
,
boost
,
and pos_inc
.
The text
attribute is a Unicode string encoded as UTF-8.
start_offset
is the start point of the token text,
measured in Unicode code points from the top of the stored field; end_offset
delimits the corresponding closing boundary.
start_offset
and end_offset
locate the Token within a larger context,
even if the Token’s text attribute gets modified – by stemming,
for instance.
The Token for “beating” in the text “beating a dead horse” begins life with a start_offset of 0 and an end_offset of 7; after stemming,
the text is “beat”,
but the start_offset is still 0 and the end_offset is still 7.
This allows “beating” to be highlighted correctly after a search matches “beat”.
boost
is a per-token weight.
Use this when you want to assign more or less importance to a particular token,
as you might for emboldened text within an HTML document,
for example.
(Note: The field this token belongs to must be spec’d to use a posting of type RichPosting.)
pos_inc
is the POSition INCrement,
measured in Tokens.
This attribute,
which defaults to 1,
is a an advanced tool for manipulating phrase matching.
Ordinarily,
Tokens are assigned consecutive position numbers: 0,
1,
and 2 for "three blind mice"
.
However,
if you set the position increment for “blind” to,
say,
1000,
then the three tokens will end up assigned to positions 0,
1,
and 1001 – and will no longer produce a phrase match for the query "three blind mice"
.
my $token = Lucy::Analysis::Token->new( text => $text, # required start_offset => $start_offset, # required end_offset => $end_offset, # required boost => 1.0, # optional pos_inc => 1, # optional );
my $text = $token->get_text;
Get the token's text.
$token->set_text($text);
Set the token's text.
my $int = $token->get_start_offset();
my $int = $token->get_end_offset();
my $float = $token->get_boost();
my $int = $token->get_pos_inc();
my $int = $token->get_len();
Lucy::Analysis::Token isa Clownfish::Obj.
Copyright © 2010-2015 The Apache Software Foundation, Licensed under the
Apache License, Version 2.0.
Apache Lucy, Lucy, Apache, the Apache feather logo, and the Apache Lucy project logo are trademarks of The
Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their
respective owners.