**Edit distance recursive algorithm — Skiena** –

I’m reading The Algorithm Design Manual by Steven Skiena, and I’m on the dynamic programming chapter. He has some example code for edit distance and uses some functions which are explained neither in the book nor on the internet. So I’m wondering

a) how does this algorithm work?

b) what do the functions indel and match do?

```
#define MATCH 0 /* enumerated type symbol for match */
#define INSERT 1 /* enumerated type symbol for insert */
#define DELETE 2 /* enumerated type symbol for delete */
int string_compare(char *s, char *t, int i, int j)
{
int k; /* counter */
int opt[3]; /* cost of the three options */
int lowest_cost; /* lowest cost */
if (i == 0) return(j * indel(' '));
if (j == 0) return(i * indel(' '));
opt[MATCH] = string_compare(s,t,i-1,j-1) + match(s[i],t[j]);
opt[INSERT] = string_compare(s,t,i,j-1) + indel(t[j]);
opt[DELETE] = string_compare(s,t,i-1,j) + indel(s[i]);
lowest_cost = opt[MATCH];
for (k=INSERT; k<=DELETE; k++)
if (opt[k] < lowest_cost) lowest_cost = opt[k];
return( lowest_cost );
}
```

On page 287 in the book:

```
int match(char c, char d)
{
if (c == d) return(0);
else return(1);
}
int indel(char c)
{
return(1);
}
```

They’re explained in the book. Please read section **8.2.4 Varieties of Edit Distance**

Basically, it utilizes the dynamic programming method of solving problems where the solution to the problem is constructed to solutions to subproblems, to avoid recomputation, either bottom-up or top-down.

The recursive structure of the problem is as given here, where `i,j`

are start (or end) indices in the two strings respectively.

Here’s an excerpt from this page that explains the algorithm well.

Problem: Given two strings of size m, n and set of operations replace

(R), insert (I) and delete (D) all at equal cost. Find minimum number

of edits (operations) required to convert one string into another.Identifying Recursive Methods:

What will be sub-problem in this case? Consider finding edit distance

of part of the strings, say small prefix. Let us denote them as

[1…i] and [1…j] for some 1< i < m and 1 < j < n. Clearly it is

solving smaller instance of final problem, denote it as E(i, j). Our

goal is finding E(m, n) and minimizing the cost.In the prefix, we can right align the strings in three ways (i, -),

(-, j) and (i, j). The hyphen symbol (-) representing no character. An

example can make it more clear.Given strings SUNDAY and SATURDAY. We want to convert SUNDAY into

SATURDAY with minimum edits. Let us pick i = 2 and j = 4 i.e. prefix

strings are SUN and SATU respectively (assume the strings indices

start at 1). The right most characters can be aligned in three

different ways.Case 1: Align characters U and U. They are equal, no edit is required.

We still left with the problem of i = 1 and j = 3, E(i-1, j-1).Case 2: Align right character from first string and no character from

second string. We need a deletion (D) here. We still left with problem

of i = 1 and j = 4, E(i-1, j).Case 3: Align right character from second string and no character from

first string. We need an insertion (I) here. We still left with

problem of i = 2 and j = 3, E(i, j-1).Combining all the subproblems minimum cost of aligning prefix strings

ending at i and j given byE(i, j) = min( [E(i-1, j) + D], [E(i, j-1) + I], [E(i-1, j-1) + R if

i,j characters are not same] )We still not yet done. What will be base case(s)?

When both of the strings are of size 0, the cost is 0. When only one

of the string is zero, we need edit operations as that of non-zero

length string. Mathematically,E(0, 0) = 0, E(i, 0) = i, E(0, j) = j

I recommend going through this lecture for a good explanation.

The function `match()`

returns 1, if the two characters mismatch (so that one more move is added in the final answer) otherwise 0.

Please go through this link:

https://secweb.cs.odu.edu/~zeil/cs361/web/website/Lectures/styles/pages/editdistance.html

the code implementing the above algorithm is :

```
int dpEdit(char *s1, char *s2 ,int len1,int len2)
{
if(len1==0) /// Base Case
return len2;
else if(len2==0)
return len1;
else
{
int add, remove,replace;
int table[len1+1][len2+2];
for(int i=0;i<=len2;i++)
table[0][i]=i;
for(int i=0;i<=len1;i++)
table[i][0]=i;
for(int i=1;i<=len1;i++)
{
for(int j=1;j<=len2;j++)
{
// Add
//
add = table[i][j-1]+1;
remove = table[i-1][j]+1;
if(s1[i-1]!=s2[j-1])
replace = table[i-1][j-1]+1;
else
replace =table[i-1][j-1];
table[i][j]= min(min(add,remove),replace); // Done :)
}
}
```

This is a recursive algorithm not dynamic programming. Note that both i & j point to the last char of s & t respectively when the algorithm starts.

indel returns 1.

match(a, b) returns 0 if a = b (match) else return 1 (substitution)

```
#define MATCH 0 /* enumerated type symbol for match */
#define INSERT 1 /* enumerated type symbol for insert */
#define DELETE 2 /* enumerated type symbol for delete */
int string_compare(char *s, char *t, int i, int j)
{
int k; /* counter */
int opt[3]; /* cost of the three options */
int lowest_cost; /* lowest cost */
// base case, if i is 0, then we reached start of s and
// now it's empty, so there would be j * 1 edit distance between s & t
// think of it if s is initially empty and t is not, how many
// edits we need to perform on s to be similar to t? answer is where
// we are at t right now which is j
if (i == 0) return(j * indel(' '));
// same reasoning as above but for s instead of t
if (j == 0) return(i * indel(' '));
// calculate opt[match] by checking if s[i] = t[j] which = 0 if true or 1 if not
// then recursively do the same for s[i-1] & t[j-1]
opt[MATCH] = string_compare(s,t,i-1,j-1) + match(s[i],t[j]);
// calculate opt[insert] which is how many chars we need to insert
// in s to make it looks like t, or look at it from the other way,
// how many chars we need to delete from t to make it similar to s?
// since we're deleting from t, we decrease j by 1 and leave i (pointer
// in s) as is + indel(t[j]) which we deleted (always returns 1)
opt[INSERT] = string_compare(s,t,i,j-1) + indel(t[j]);
// same reasoning as before but deleting from s or inserting into t
opt[DELETE] = string_compare(s,t,i-1,j) + indel(s[i]);
// these lines are just to pick the min of opt[match], opt[insert], and
// opt[delete]
lowest_cost = opt[MATCH];
for (k=INSERT; k<=DELETE; k++)
if (opt[k] < lowest_cost) lowest_cost = opt[k];
return( lowest_cost );
}
```

The algorithm is not hard to understand, you just need to read it couple of times. What’s always amuse me is the person who invented it and the trust that recursion will do the right thing.

This is likely a non-issue for the OP by now, but I’ll write down my understanding of the text.

```
/**
* Returns the cost of a substitution(match) operation
*/
int match(char c, char d)
{
if (c == d) return 0
else return 1
}
/**
* Returns the cost of an insert/delete operation(assumed to be a constant operation)
*/
int indel(char c)
{
return 1
}
```

The edit distance is essentially the minimum number of modifications on a given string, required to transform it into another reference string. The modifications,as you know, can be the following.

- Substitution (Replacing a single character)
- Insert (Insert a single character into the string)
- Delete (Deleting a single character from the string)

Now,

Properly posing the question of string similarity requires us to set the cost of each of these string transform operations. Assigning each operation an equal cost of 1 defines the edit distance between two strings.

So that establishes that each of the three modifications known to us have a constant cost, O(1).

But how do we know where to modify?

We instead look for modifications that may or may not be needed from the end of the string, character by character. So,

- We count all substitution operations, starting from the end of the string
- We count all delete operations, starting from the end of the string
- We count all insert operations, starting from the end of the string

Finally, once we have this data, we return the minimum of the above three sums.