Text indexing in python - mapping text to values using finite-state transducers
In the previous posts I wrote about the finite-state automata (FSA) - data structures that could store a set of words. Sets are quite useful, but not nearly as useful as dictionaries. Luckily for us we can quite easily extend the implementation of FSA and allow it to also store the mapping from the words to values. Such data structures are called finite-state transducers (FST).
In the post I will focus on storing the natural numbers as they can be in turn used as an index to the array which can store any type of value, but storing e.g. strings is also a viable option. The algorithm presented below was published by Mihov and Maurel and adapted to store integers. If you haven’t read the previous post I highly encourage it as this post will be building upon the foundations laid out there.
The base idea is to take the finite-state automaton and add a single value to each transition (edge), and a list of values to a state (node). The value (or values) the word maps to can be retrieved by following the transitions indexed by the consecutive letters of the word and summing up values on the edges along the way, and then adding them to the values in the final node. Let’s take a look at the example:
What’s the value of the “cumber” word? It’s value on the
s0->s1 edge (
3) plus value on the node
2) which equals to
5. “cumberer” value is
s0->s1 edge value (
3) plus final node value (
0) which equals to
3. “shrove” has two values stored -
7. The value of the word “sepsis” is
2 and the value of the word “serer” is
One important thing to note now is that there seems to be many possibilities in which you could put e.g the value
3 on the “cumber” path. In this case the result would be the same if it was put on the
s1->s2 edge or
s2->s3 edge and so on. Why did we decide to put it on the
s0->s1 edge? We want to put the values as close to the starting node as possible. If we didn’t do this we could create a situation where we cannot minimize the suffixes because of the values stored there. This allows us to build the smallest automaton.
Hopefully this gives and idea how the vales can be stored using the automaton. Now let’s see some code.
Updated State code
First let’s take a look at the
State subclass code:
class MultiValueState(State): def __init__(self, label): super().__init__(label) self.value = set() def add_transition_to_a_new_state(self, c, label, value=0): s = self.new(label) self.add_transition(c, (s, value)) return s def add_suffix(self, word, labels): state = self for c in word: state = state.add_transition_to_a_new_state(c, next(labels)) state.final = True return state def next_state(self, c): x = self.next_state_with_value(c) if x is not None: return x else: return None def next_state_with_value(self, c): return self.transitions.get(c, None) def set_state_value(self, value): self.value = value def get_word_value(self, w): s = self t_v = 0 for c in w: s, v = s.next_state_with_value(c) t_v += v if s is None: return  if s.final: return [d + t_v for d in s.value] else: return 
The changes mostly have to do with storing pairs of (
int) instead of just
State in the transition dictionary. Exception is
get_word_value method - it gets the values encoded in the automation for a given word
Equivalency of the FST nodes
If you remember from the previous post an important part of the algorithm building the FSA was replacing the new nodes with their equivalents already present in the automaton to get the minimal possible version of it. It’s the same in the case of FST. Below you’ll find an updated
replace_and_register method which aim is to minimize the last added word. The change from before is the need to preserve the value stored on the edge when you’re pointing it to an equivalent node. The first and last line of the method were modified to accomplish that:
def replace_or_register(state, register): last_c, (last_child, value) = state.last_transition() if last_child.transitions: replace_or_register(last_child, register) eq_state = register.get(last_child) if eq_state is None: register.put(last_child) else: state.add_transition(last_c, (eq_state, value))
Next we need to update our definition of equivalency as our definition of states changed. Now they are storing some values. Before we stated the states are equivalent if:
- they are both final or both non-final,
- their transitions have the same labels and lead to equivalent states.
In addition to the points above we’ll add the following conditions:
- the values stored on the states are the same,
- the transitions have the same values stored on them.
To accommodate for these changes we need to update our definition of key method, which is used in the state register to provide hashable key for a given state.
def key_obj_multi(state): l = [state.final] if state.final: l += sorted(state.value) for k, (s, v) in sorted(state.transitions.items(), key=lambda x: x): l.extend([k, id(s), v]) return tuple(l)
It’s now the time to take a look at the algorithm which builds the automaton. I put some comments inside which hopefully will be enough to explain the reasoning behind the code.
def minimal_fst_from_words(word_value_pairs): # setting up the register with the new 'key' method register = Register(key_obj_multi) labels = inf_labels() start_state = MultiValueState(next(labels)) prev_word = None for word, word_value in word_value_pairs: if prev_word is not None: assert prev_word <= word # here we do exactly the same thing we did when building FSA last_state, pref_idx = common_prefix(start_state, word) if last_state.transitions: replace_or_register(last_state, register) last_added = last_state.add_suffix(word[pref_idx:], labels) # here starts the part responsible for handling the values prev_state = start_state stored_value = word_value # here we manage the values on the common part of the last added word # and currently added word # we try to update the values so they're correct with the new word present # and to keep them as close to the start state as possible for j in range(pref_idx): current_state, current_state_value = \ prev_state.next_state_with_value(word[j]) # we take the min of currently stored value and the value we want # to store on the path, and we store it in the current transition common_value = min(current_state_value, stored_value) prev_state.add_transition(word[j], (current_state, common_value)) # we compute the remainder and push it downstream to all the # transitions push_next_value = current_state_value - common_value for c, (s, v) in list(current_state.transitions.items()): current_state.add_transition(c, (s, push_next_value + v)) # if we're in the final state we store the remainder value on it if current_state.final: current_state.set_state_value( set([d + push_next_value for d in current_state.value]) ) # we compute the value we still need to store on our path in # the downstram transitions stored_value = stored_value - common_value prev_state = current_state #assert current_state should be just before last state last_idx = pref_idx if prev_word == word: # if we have word repetition just store extra value on the last state last_added.value.add(stored_value) elif prev_state.transitions: # otheriwse store the remainder just after the common prefix current_state, current_state_value = \ prev_state.next_state_with_value(word[last_idx]) prev_state.add_transition( word[last_idx], (current_state, stored_value) ) prev_word = word replace_or_register(start_state, register) return start_state, register
I hope you enjoyed reading the article. If you would like to contact me regarding the post drop me an email at: tostr.pl [at] gmail.com