Text Processors¶
This module contains classes and functions for text processing. It is imported as follows:
import pynlpl.textprocessors
Tokenisation¶
A very crude tokeniser is available in the form of the function pynlpl.textprocessors.crude_tokeniser(string)
. This will split punctuation characters from words and returns a list of tokens. It however has no regard for abbreviations and end-of-sentence detection, which is functionality a more sophisticated tokeniser can provide:
tokens = pynlpl.textprocessors.crude_tokeniser("to be, or not to be.")
This will result in:
tokens == [‘to’,’be’,’,’,’or’,’not’,’to’,’be’,’.’]
N-gram extraction¶
The extraction of n-grams is an elemental operation in Natural Language Processing. PyNLPl offers the Windower
class to accomplish this task:
tokens = pynlpl.textprocessors.crude_tokeniser("to be or not to be")
for trigram in Windower(tokens,3):
print trigram
The input to the Windower should be a list of words and a value for n. In addition, the windower can output extra symbols at the beginning of the input sequence and at the end of it. By default, this behaviour is enabled and the input symbol is <begin>
, whereas the output symbol is <end>
. If this behaviour is unwanted you can suppress it by instantiating the Windower as follows:
Windower(tokens,3, None, None)
The Windower is implemented as a Python generator and at each iteration yields a tuple of length n.
-
class
pynlpl.textprocessors.
MultiWindower
(tokens, min_n=1, max_n=9, beginmarker=None, endmarker=None)¶ Extract n-grams of various configurations from a sequence
-
class
pynlpl.textprocessors.
ReflowText
(stream, filternontext=True)¶ Attempts to re-flow a text that has arbitrary line endings in it. Also undoes hyphenisation
-
class
pynlpl.textprocessors.
Tokenizer
(stream, splitsentences=True, onesentenceperline=False, regexps=(re.compile('^(?:(?:https?):(?:(?://)|(?:\\\\))|www\.)(?:[\w\d:#@%/;$()~_?\+-=\\\.&](?:#!)?)*'), re.compile('^[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+(?:\.[a-zA-Z]+)+')))¶ A tokenizer and sentence splitter, which acts on a file/stream-like object and when iterating over the object it yields a lists of tokens (in case the sentence splitter is active (default)), or a token (if the sentence splitter is deactivated).
-
class
pynlpl.textprocessors.
Windower
(tokens, n=1, beginmarker='<begin>', endmarker='<end>')¶ Moves a sliding window over a list of tokens, upon iteration in yields all n-grams of specified size in a tuple.
Example without markers:
>>> for ngram in Windower("This is a test .",3, None, None): ... print(" ".join(ngram)) This is a is a test a test .
Example with default markers:
>>> for ngram in Windower("This is a test .",3): ... print(" ".join(ngram)) <begin> <begin> This <begin> This is This is a is a test a test . test . <end> . <end> <end>
-
pynlpl.textprocessors.
calculate_overlap
(haystack, needle, allowpartial=True)¶ Calculate the overlap between two sequences. Yields (overlap, placement) tuples (multiple because there may be multiple overlaps!). The former is the part of the sequence that overlaps, and the latter is -1 if the overlap is on the left side, 0 if it is a subset, 1 if it overlaps on the right side, 2 if its an identical match
-
pynlpl.textprocessors.
crude_tokenizer
(text)¶ Replaced by tokenize(). Alias
-
pynlpl.textprocessors.
find_keyword_in_context
(tokens, keyword, contextsize=1)¶ Find a keyword in a particular sequence of tokens, and return the local context. Contextsize is the number of words to the left and right. The keyword may have multiple word, in which case it should to passed as a tuple or list
-
pynlpl.textprocessors.
is_end_of_sentence
(tokens, i)¶
-
pynlpl.textprocessors.
split_sentences
(tokens)¶ Split sentences (based on tokenised data), returns sentences as a list of lists of tokens, each sentence is a list of tokens
-
pynlpl.textprocessors.
strip_accents
(s, encoding='utf-8')¶ Strip characters with diacritics and return a flat ascii representation
-
pynlpl.textprocessors.
swap
(tokens, maxdist=2)¶ Perform a swap operation on a sequence of tokens, exhaustively swapping all tokens up to the maximum specified distance. This is a subset of all permutations.
-
pynlpl.textprocessors.
tokenise
(text, regexps=(re.compile('^(?:(?:https?):(?:(?://)|(?:\\\\\\\\))|www\\.)(?:[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&](?:#!)?)*'), re.compile('^[A-Za-z0-9\\.\\+_-]+@[A-Za-z0-9\\._-]+(?:\\.[a-zA-Z]+)+')))¶ Alias for the British
-
pynlpl.textprocessors.
tokenize
(text, regexps=(re.compile('^(?:(?:https?):(?:(?://)|(?:\\\\\\\\))|www\\.)(?:[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&](?:#!)?)*'), re.compile('^[A-Za-z0-9\\.\\+_-]+@[A-Za-z0-9\\._-]+(?:\\.[a-zA-Z]+)+')))¶ Tokenizes a string and returns a list of tokens
Parameters: - text (string) – The text to tokenise
- regexps (Tuple/list of regular expressions to use in tokenisation) – Regular expressions to use as tokeniser rules in tokenisation (default=_pynlpl.textprocessors.TOKENIZERRULES_)
Return type: Returns a list of tokens
Examples:
>>> for token in tokenize("This is a test."): ... print(token) This is a test .