Lucy::Analysis::Token – Apache Lucy Documentation

Apache » Lucy » Docs » Perl » Lucy » Analysis

About

Resources

Related Projects

SYNOPSIS

    my $token = Lucy::Analysis::Token->new(
        text         => 'blind',
        start_offset => 8,
        end_offset   => 13,
    );

    $token->set_text('mice');

DESCRIPTION

Token is the fundamental unit used by Apache Lucy’s Analyzer subclasses. Each Token has 5 attributes: text, start_offset, end_offset, boost, and pos_inc.

The text attribute is a Unicode string encoded as UTF-8.

start_offset is the start point of the token text, measured in Unicode code points from the top of the stored field; end_offset delimits the corresponding closing boundary. start_offset and end_offset locate the Token within a larger context, even if the Token’s text attribute gets modified – by stemming, for instance. The Token for “beating” in the text “beating a dead horse” begins life with a start_offset of 0 and an end_offset of 7; after stemming, the text is “beat”, but the start_offset is still 0 and the end_offset is still 7. This allows “beating” to be highlighted correctly after a search matches “beat”.

boost is a per-token weight. Use this when you want to assign more or less importance to a particular token, as you might for emboldened text within an HTML document, for example. (Note: The field this token belongs to must be spec’d to use a posting of type RichPosting.)

pos_inc is the POSition INCrement, measured in Tokens. This attribute, which defaults to 1, is a an advanced tool for manipulating phrase matching. Ordinarily, Tokens are assigned consecutive position numbers: 0, 1, and 2 for "three blind mice". However, if you set the position increment for “blind” to, say, 1000, then the three tokens will end up assigned to positions 0, 1, and 1001 – and will no longer produce a phrase match for the query "three blind mice".

CONSTRUCTORS

new

my $token = Lucy::Analysis::Token->new(
    text         => $text,          # required
    start_offset => $start_offset,  # required
    end_offset   => $end_offset,    # required
    boost        => 1.0,            # optional
    pos_inc      => 1,              # optional
);

text - A string.
start_offset - Start offset into the original document in Unicode code points.
start_offset - End offset into the original document in Unicode code points.
boost - Per-token weight.
pos_inc - Position increment for phrase matching.

METHODS

get_text

my $text = $token->get_text;

Get the token's text.

set_text

$token->set_text($text);

Set the token's text.

get_start_offset

my $int = $token->get_start_offset();

get_end_offset

my $int = $token->get_end_offset();

get_boost

my $float = $token->get_boost();

get_pos_inc

my $int = $token->get_pos_inc();

get_len

my $int = $token->get_len();

INHERITANCE

Lucy::Analysis::Token isa Clownfish::Obj.

Copyright © 2010-2015 The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
Apache Lucy, Lucy, Apache, the Apache feather logo, and the Apache Lucy project logo are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners.