parcel | Lucy |
class variable | LUCY_TOKEN |
struct symbol | lucy_Token |
class nickname | lucy_Token |
header file | Lucy/Analysis/Token.h |
Lucy::Analysis::Token – Unit of text.
Token is the fundamental unit used by Apache Lucy’s Analyzer subclasses.
Each Token has 5 attributes: text
, start_offset
,
end_offset
, boost
, and pos_inc
.
The text
attribute is a Unicode string encoded as UTF-8.
start_offset
is the start point of the token text, measured in
Unicode code points from the top of the stored field;
end_offset
delimits the corresponding closing boundary.
start_offset
and end_offset
locate the Token
within a larger context, even if the Token’s text attribute gets modified
– by stemming, for instance. The Token for “beating” in the text “beating
a dead horse” begins life with a start_offset of 0 and an end_offset of 7;
after stemming, the text is “beat”, but the start_offset is still 0 and the
end_offset is still 7. This allows “beating” to be highlighted correctly
after a search matches “beat”.
boost
is a per-token weight. Use this when you want to assign
more or less importance to a particular token, as you might for emboldened
text within an HTML document, for example. (Note: The field this token
belongs to must be spec’d to use a posting of type RichPosting.)
pos_inc
is the POSition INCrement, measured in Tokens. This
attribute, which defaults to 1, is a an advanced tool for manipulating
phrase matching. Ordinarily, Tokens are assigned consecutive position
numbers: 0, 1, and 2 for "three blind mice"
. However, if you
set the position increment for “blind” to, say, 1000, then the three tokens
will end up assigned to positions 0, 1, and 1001 – and will no longer
produce a phrase match for the query "three blind mice"
.
lucy_Token* // incremented
lucy_Token_new(
char *text,
size_t len,
uint32_t start_offset,
uint32_t end_offset,
float boost,
int32_t pos_inc
);
Create a new Token.
A UTF-8 string.
Size of the string in bytes.
Start offset into the original document in Unicode code points.
End offset into the original document in Unicode code points.
Per-token weight.
Position increment for phrase matching.
lucy_Token*
lucy_Token_init(
lucy_Token *self,
char *text,
size_t len,
uint32_t start_offset,
uint32_t end_offset,
float boost,
int32_t pos_inc
);
Initialize a Token.
A UTF-8 string.
Size of the string in bytes.
Start offset into the original document in Unicode code points.
End offset into the original document in Unicode code points.
Per-token weight.
Position increment for phrase matching.
uint32_t
lucy_Token_Get_Start_Offset(
lucy_Token *self
);
uint32_t
lucy_Token_Get_End_Offset(
lucy_Token *self
);
float
lucy_Token_Get_Boost(
lucy_Token *self
);
int32_t
lucy_Token_Get_Pos_Inc(
lucy_Token *self
);
char*
lucy_Token_Get_Text(
lucy_Token *self
);
size_t
lucy_Token_Get_Len(
lucy_Token *self
);
void
lucy_Token_Set_Text(
lucy_Token *self,
char *text,
size_t len
);
Lucy::Analysis::Token is a Clownfish::Obj.
Copyright © 2010-2015 The Apache Software Foundation, Licensed under the
Apache License, Version 2.0.
Apache Lucy, Lucy, Apache, the Apache feather logo, and the Apache Lucy project logo are trademarks of The
Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their
respective owners.