PIRL

PIRL.Strings
Class Words

java.lang.Object
  extended by PIRL.Strings.Words

public class Words
extends Object

The Words class provides a mechanism to treat a String as a sequence of delimited words.

Version:
1.24
Author:
Bradford Castalia, UA/PIRL
See Also:
String_Buffer

Nested Class Summary
 class Words.Word_Index
          A Word_Index provides a start,end string location for a word.
 
Field Summary
static String DEFAULT_DELIMITERS
          The default word delimiters;
static String DEFAULT_MASK
          The default word mask.
static boolean DELIMIT_AT_QUOTE
          Whether or not to delimit words at quotes.
 int End_Index
          The index (exclusive) in the current string where the current word ends.
static String ID
           
 Words.Word_Index Mark_Index
          The Word_Index the last marked word.
static boolean PARENTHESIZED_WORDS
          Whether or not to treat parenthesized strings as a word.
static boolean QUOTED_WORDS
          Whether or not to treat quoted strings as a word.
 int Start_Index
          The index in the current string where the next word starts.
 
Constructor Summary
Words()
          Constructs Words with no characters.
Words(String characters)
          Constructs Words from a String of characters.
 
Method Summary
 Words Characters(String characters)
          Sets the String of characters.
 boolean Delimit_at_Quote()
          Test if quotes will delimit words.
 Words Delimit_at_Quote(boolean enable)
          Enable or disable delimiting words at quotes.
 String Delimiters()
          Gets the current delimiters.
 Words Delimiters(String delimiters)
          Sets the word delimiter characters.
 Words Location(int location)
          Moves the word indices to a new location.
 Words Mark()
          Marks the current word location.
 String Mask()
          Gets the word mask.
 Words Mask(String mask)
          Sets the mask to use when words are masked.
 Words Mask(Vector<String> names)
          Words preceeded by any one of a set of names are masked.
 Words.Word_Index Next_Location()
          Moves the word indices to the location of the next word.
 String Next_Word()
          Gets the next word.
 boolean Parenthesized_Words()
          Test if parenthesized strings are treated as single words.
 Words Parenthesized_Words(boolean enable)
          Enable or disable the treatment of parenthesized strings as words.
 boolean Quoted_Words()
          Test if quoted strings are treated as single words.
 Words Quoted_Words(boolean enable)
          Enable or disable the treatment of quoted strings as words.
 Words Restore()
          Restores the current word to the last marked location.
 Vector<String> Split()
          Splits the remaining characters into words.
 Vector<String> Split(int limit)
          Splits the remaining characters into words.
 String Substring(int start)
          Gets a substring of the characters.
 String Substring(int start, int end)
          Gets a substring of the characters.
 String Substring(Words.Word_Index word_index)
          Gets a substring of the characters.
 String toString()
          Gets the Words characters.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

ID

public static final String ID
See Also:
Constant Field Values

Start_Index

public int Start_Index
The index in the current string where the next word starts.

Initially, this is the beginning (0) of the words string. This will be -1 when there are no more words.


End_Index

public int End_Index
The index (exclusive) in the current string where the current word ends.

This will be zero if no Next_Word has been selected. This will be -1 when there are no more words.


Mark_Index

public Words.Word_Index Mark_Index
The Word_Index the last marked word.


DEFAULT_DELIMITERS

public static final String DEFAULT_DELIMITERS
The default word delimiters;

The usual whitespace characters: " \n\r\t".

See Also:
Constant Field Values

DEFAULT_MASK

public static final String DEFAULT_MASK
The default word mask.

See Also:
Constant Field Values

QUOTED_WORDS

public static boolean QUOTED_WORDS
Whether or not to treat quoted strings as a word.


DELIMIT_AT_QUOTE

public static boolean DELIMIT_AT_QUOTE
Whether or not to delimit words at quotes.


PARENTHESIZED_WORDS

public static boolean PARENTHESIZED_WORDS
Whether or not to treat parenthesized strings as a word.

Constructor Detail

Words

public Words(String characters)
Constructs Words from a String of characters.

Parameters:
characters - The String of characters containing words.

Words

public Words()
Constructs Words with no characters.

See Also:
Characters(String)
Method Detail

toString

public String toString()
Gets the Words characters.

Overrides:
toString in class Object
Returns:
The current String of characters.

Characters

public Words Characters(String characters)
Sets the String of characters.

The current location is reset to the beginning of the string. The Mark_Index is reset to (0,0).

Parameters:
characters - The String of characters.
Returns:
This Words object.
See Also:
Location(int)

Delimiters

public Words Delimiters(String delimiters)
Sets the word delimiter characters.

A word is delimited by a contiguous sequence of characters that are all members of the delimiters characters. Note that a sequence of more than one the same or different characters from the delimiters set does not result in empty words; i.e. any continguous sequence of one or more delimiters is treated as a single word delimiter.

N.B.: Any character starting a special sequence should not be included as one of the delimiter characters. If they are then special sequence recognition will be effectively disabled.

Parameters:
delimiters - The String of delimiter characters. If null, the DEFAULT_DELIMITERS will be used.
Returns:
This Words object.
See Also:
Quoted_Words(boolean), Parenthesized_Words(boolean)

Delimiters

public String Delimiters()
Gets the current delimiters.

Returns:
The String of delimiter characters.

Quoted_Words

public Words Quoted_Words(boolean enable)
Enable or disable the treatment of quoted strings as words.

The enclosing quote characters are included in the word. If there is no matching unescaped quote character before the end of the string, the resulting word will not have the matching closing quote character at End_Index - 1.

Parameters:
enable - true if all characters within unescaped quotes (' or ") are to be treated as a single word; false otherwise.
Returns:
This Words object.
See Also:
Delimit_at_Quote(boolean), Next_Location()

Quoted_Words

public boolean Quoted_Words()
Test if quoted strings are treated as single words.

Returns:
true if all characters within unescaped quotes (' or ") will be treated as a single word; false otherwise.
See Also:
Quoted_Words(boolean)

Delimit_at_Quote

public Words Delimit_at_Quote(boolean enable)
Enable or disable delimiting words at quotes.

When unescaped quote characters are encountered they may delimit a word even if no delimiter character preceeds or follows the quote. Disabling quote delimiting causes contiguous non-delimiter characters to be included as part of the quoted string word. The quotes remain in the word in either case.

Parameters:
enable - true if unescaped quote characters delimit a word; false otherwise.
Returns:
This Words object.
See Also:
Quoted_Words(boolean)

Delimit_at_Quote

public boolean Delimit_at_Quote()
Test if quotes will delimit words.

Returns:
true if unescaped quote characters delimit a word; false otherwise.
See Also:
Delimit_at_Quote(boolean)

Parenthesized_Words

public Words Parenthesized_Words(boolean enable)
Enable or disable the treatment of parenthesized strings as words.

N.B.: Nested parenthesized strings are included in a parenthesized string.

The enclosing parentheses characters are included in the word. If there is no matching unescaped closing parenthesis character (ignoring nested parentheses) before the end of the string, the resulting word will not have one at End_Index - 1.

Parameters:
enable - true if all characters within unescaped parenthesized ('(' and ')') strings are to be treated as a single word; false otherwise.
Returns:
This Words object.
See Also:
Next_Location()

Parenthesized_Words

public boolean Parenthesized_Words()
Test if parenthesized strings are treated as single words.

Returns:
true if all characters within unescaped parenthesized strings will be treated as a single word; false otherwise.
See Also:
Parenthesized_Words(boolean)

Substring

public String Substring(int start,
                        int end)
Gets a substring of the characters.

Parameters:
start - The start index of the substring.
end - The end index of the substring.
Returns:
The substring from the start index up to, but not including, the end index.
See Also:
StringBuffer.substring(int, int)

Substring

public String Substring(int start)
Gets a substring of the characters.

Parameters:
start - The start index of the substring.
Returns:
The substring from the start index to the end of the characters string.
See Also:
StringBuffer.substring(int)

Substring

public String Substring(Words.Word_Index word_index)
Gets a substring of the characters.

Parameters:
word_index - A Word_Index for the substring.
Returns:
The substring from the word_index.Start_Index up to, but not including, the word_index.End_Index.
See Also:
StringBuffer.substring(int, int)

Location

public Words Location(int location)
               throws IndexOutOfBoundsException
Moves the word indices to a new location.

The location must be within the words string.

Parameters:
location - An index in the words string.
Returns:
This Words object.
Throws:
IndexOutOfBoundsException - If the location is not within the words string.

Mark

public Words Mark()
Marks the current word location.

The current Start_Index and End_Index are stored in the Mark_Index.

Returns:
This Words object.
See Also:
Restore()

Restore

public Words Restore()
Restores the current word to the last marked location.

Returns:
This Words object.
Throws:
StringIndexOutOfBoundsException - If the Word_Index.Start_Index is less than zero or the Word_Index.End_Index is greater than the number of characters available.
See Also:
Mark()

Next_Location

public Words.Word_Index Next_Location()
Moves the word indices to the location of the next word.

Beginning at the current End_Index all delimiter characters are skipped to find the new Start_Index. If the end of the characters string is reached without finding a non-delimiter character then there are no more words available. In this case both the Start_Index and End_Index will be equal to the character string length and nothing more will be done.

N.B.: Any character starting a special sequence should not be included as one of the delimiter characters. If they are then special sequence recognition will be effectively disabled.

The character at the Start_Index is checked to see if it starts a special sequence. If quoted words is enabled either a single (') or double (") quote character will be recognized and set as the end of sequence marker character. If parenthesized words is enabled an opening parenthesis ('(') character will be recognized and the end of sequence marker character will be set to the closing parenthesis (')') character. A special sequence start character is included as part of the word.

When delimit at quote is enabled in addition to quoted words being enabled quoted strings are delimited as separate words even if a contiguous non-delimiter character preceeds and/or follows the enclosing quotes. When delimit at quote is disabled the contiguous non-delimiter characters are treated as part of the word that includes the quoted string.

The word contains all characters up to and including an unescaped end of sequence marker character. For a parenthesized sequence the marker character must be at parenthesis level zero to end the sequence; unescaped nested parentheses increase the parenthesis level. Note that a special sequence may include what would otherwise be considered delimiter characters, and the enclosing characters - quotes or parentheses - are included as part of the word. If the end of the characters string is reached before the expected marker character is found the resulting word will be "unbalanced"; the character at End_Index - 1 will not be the marker character.

If no end of sequence marker character has been set, then the word will end when any unescaped delimiter character is found or the end of the characters string is reached. If quoted words are enabled a quote character will be recognized as a delimiter character. If parenthesized words are enabled an opening parenthesis character will be recognized as a delimiter character. The index of the delimiter character becomes the new End_Index; it is not included as part of the word.

Any character preceded by a backslash ('\') character is escaped from any special treatment. All escaped characters are taken to be part of the word, the backslash character included.

Returns:
A Word_Index for the next word. If there are no more words the word index will be set to the end of the characters.

Next_Word

public String Next_Word()
Gets the next word.

The Start_Index will be moved forward from the current End_Index over any Delimiters. Then the End_Index will be moved forward from the Start_Index until any Delimiters are found or the end of the string is reached.

Returns:
The substring from the next Start_Index up to, but not including, the next End_Index. If there are no more words, the empty String will be returned.
See Also:
Next_Location()

Split

public Vector<String> Split(int limit)
Splits the remaining characters into words.

Beginning with the next word, words are collected into a Vector in the order they occur in the string.

If the limit is 0 all available words will be returned; no delimiters will be included in any word that is returned. If the limit is positive (> 0) no more than limit words will be returned; the last "word" will contain all characters, including any delimiters, following the start of the last word (delimiters preceeding the last word will not be included). A negative limit acts the same as a positive limit except the last "word" will contain all characters following the end of the previous word (delimiters preceeding the last word will be included). Note that a limit of -1 will return all characters from the current End_Index to the end of the the characters string.

Less than limit words my be returned. No empty words will be returned.

Parameters:
limit - The word limit to return.
Returns:
A Vector of zero or more words.
See Also:
Next_Location()

Split

public Vector<String> Split()
Splits the remaining characters into words.

Beginning with the next word, words are collected into a Vector in the order they occur in the string.

Returns:
A Vector of zero or more words.
See Also:
Split(int)

Mask

public Words Mask(String mask)
Sets the mask to use when words are masked.

Parameters:
mask - The mask String. This may be null.
Returns:
This Words object.
See Also:
Mask(Vector)

Mask

public String Mask()
Gets the word mask.

Returns:
The String to me used when masking words.
See Also:
Mask(Vector)

Mask

public Words Mask(Vector<String> names)
Words preceeded by any one of a set of names are masked.

The words are searched for matches with the names. When a match is found, the following word is replaced with the mask String. If the mask String is null the preceeding name as well its word is deleted.

N.B.: The mask string may be one of the names. The mask substitution is never compared against the names list.

Parameters:
names - A Vector of names to find.
Returns:
This Words object.
See Also:
Mask(String), Next_Word()

PIRL

Copyright (C) \ 2003-2009 Bradford Castalia, University of Arizona