Statistics and Information Theory¶
This module contains classes and functions for statistics and information theory. It is imported as follows:
import pynlpl.statistics
Generic functions¶
Amongst others, the following generic statistical functions are available:
* ``mean(list)`` - Computes the mean of a given list of numbers
median(list)
- Computes the median of a given list of numbersstddev(list)
- Computes the standard deviation of a given list of numbersnormalize(list)
- Normalizes a list of numbers so that the sum is 1.0 .
Frequency Lists and Distributions¶
One of the most basic and widespread tasks in NLP is the creation of a frequency list. Counting is established by simply appending lists to the frequencylist:
freqlist = pynlpl.statistics.FrequencyList()
freqlist.append(['to','be','or','not','to','be'])
Take care not to append lists rather than strings unless you mean to create a frequency list over its characters rather than words. You may want to use the pynlpl.textprocessors.crudetokeniser
first:
freqlist.append(pynlpl.textprocessors.crude_tokeniser("to be or not to be"))
The count can also be incremented explicitly explicitly for a single item:
freqlist.count(‘shakespeare’)
The FrequencyList offers dictionary-like access. For example, the following statement will be true for the frequency list just created:
freqlist['be'] == 2
Normalised counts (pseudo-probabilities) can be obtained using the p()
method:
freqlist.p('be')
Normalised counts can also be obtained by instantiation a Distribution instance using the frequency list:
dist = pynlpl.statistics.Distribution(freqlist)
This too offers a dictionary-like interface, where values are by definition normalised. The advantage of a Distribution class is that it offers information-theoretic methods such as entropy()
, maxentropy()
, perplexity()
and poslog()
.
A frequency list can be saved to file using the save(filename)
method, and loaded back from file using the load(filename)
method. The output()
method is a generator yielding strings for each line of output, in ranked order.
API Reference¶
This is a Python library containing classes for Statistic and Information Theoretical computations. It also contains some code from Peter Norvig, AI: A Modern Appproach : http://aima.cs.berkeley.edu/python/utils.html
-
class
pynlpl.statistics.
Distribution
(data, base=2)¶ A distribution can be created over a FrequencyList or a plain dictionary with numeric values. It will be normalized automatically. This implemtation uses dictionaries/hashing
-
entropy
(base=2)¶ Compute the entropy of the distribution
-
information
(type)¶ Computes the information content of the specified type: -log_e(p(X))
-
items
()¶ Returns an unranked list of (type, prob) pairs. Use this only if you are not interested in the order.
-
keys
()¶
-
maxentropy
(base=2)¶ Compute the maximum entropy of the distribution: log_e(N)
-
mode
()¶ Returns the type that occurs the most frequently in the probability distribution
-
output
(delimiter='\t', freqlist=None)¶ Generator yielding formatted strings expressing the time and probabily for each item in the distribution
-
perplexity
(base=2)¶
-
poslog
(type)¶ alias for information content
-
values
()¶
-
-
class
pynlpl.statistics.
FrequencyList
(tokens=None, casesensitive=True, dovalidation=True)¶ A frequency list (implemented using dictionaries)
-
append
(tokens)¶ Add a list of tokens to the frequencylist. This method will count them for you.
-
count
(type, amount=1)¶ Count a certain type. The counter will increase by the amount specified (defaults to one)
-
dict
()¶
-
items
()¶ Returns an unranked list of (type, count) pairs. Use this only if you are not interested in the order.
-
load
(filename)¶ Load a frequency list from file (in the format produced by the save method)
-
mode
()¶ Returns the type that occurs the most frequently in the frequency list
-
output
(delimiter='\t', addnormalised=False)¶ Print a representation of the frequency list
-
p
(type)¶ Returns the probability (relative frequency) of the token
-
save
(filename, addnormalised=False)¶ Save a frequency list to file, can be loaded later using the load method
-
sum
()¶ Returns the total amount of tokens
-
tokens
()¶ Returns the total amount of tokens
-
typetokenratio
()¶ Computes the type/token ratio
-
values
()¶
-
-
class
pynlpl.statistics.
HiddenMarkovModel
(startstate, endstate=None)¶ -
print_dptable
(V)¶
-
setemission
(state, distribution)¶
-
viterbi
(observations, doprint=False)¶
-
-
class
pynlpl.statistics.
MarkovChain
(startstate, endstate=None)¶ -
accessible
(fromstate, tostate)¶ Is state tonode directly accessible (in one step) from state fromnode? (i.e. is there an edge between the nodes). If so, return the probability, else zero
-
communicates
(fromstate, tostate, maxlength=999999)¶ See if a node communicates (directly or indirectly) with another. Returns the probability of the shortest path (probably, but not necessarily the highest probability)
-
p
(sequence, subsequence=True)¶ Returns the probability of the given sequence or subsequence (if subsequence=True, default).
-
reducible
()¶
-
settransitions
(state, distribution)¶
-
size
()¶
-
-
pynlpl.statistics.
dotproduct
(X, Y)¶ Return the sum of the element-wise product of vectors x and y. >>> dotproduct([1, 2, 3], [1000, 100, 10]) 1230
-
pynlpl.statistics.
histogram
(values, mode=0, bin_function=None)¶ Return a list of (value, count) pairs, summarizing the input values. Sorted by increasing value, or if mode=1, by decreasing count. If bin_function is given, map it over values first.
-
pynlpl.statistics.
levenshtein
(s1, s2, maxdistance=9999)¶ Computes the levenshtein distance between two strings. Adapted from: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python
-
pynlpl.statistics.
log2
(x)¶ Base 2 logarithm. >>> log2(1024) 10.0
-
pynlpl.statistics.
mean
(values)¶ Return the arithmetic average of the values.
-
pynlpl.statistics.
median
(values)¶ Return the middle value, when the values are sorted. If there are an odd number of elements, try to average the middle two. If they can’t be averaged (e.g. they are strings), choose one at random. >>> median([10, 100, 11]) 11 >>> median([1, 2, 3, 4]) 2.5
-
pynlpl.statistics.
mode
(values)¶ Return the most common value in the list of values. >>> mode([1, 2, 3, 2]) 2
-
pynlpl.statistics.
normalize
(numbers, total=1.0)¶ Multiply each number by a constant such that the sum is 1.0 (or total). >>> normalize([1,2,1]) [0.25, 0.5, 0.25]
-
pynlpl.statistics.
product
(seq)¶ Return the product of a sequence of numerical values. >>> product([1,2,6]) 12
-
pynlpl.statistics.
stddev
(values, meanval=None)¶ The standard deviation of a set of values. Pass in the mean if you already know it.
-
pynlpl.statistics.
vector_add
(a, b)¶ Component-wise addition of two vectors. >>> vector_add((0, 1), (8, 9)) (8, 10)