Code

Solving String Similarity

Problem

The main focus today was to identify the similarity between two strings and derive the relevant records from a query.

To better illustrate this, let’s look at a sample table:

1
2
3
4
5
6
7
8
9
10
11
id                     string
mjZThqXnp7 QB1934
odGf0DZuUR JU9910
jemvgDUw6T UJ991
xkVGZbpG05 DF987
MtaNYMMgkV LF9111
jqZCZI0Fdn MA9114
xRNLSPEdAY MA914
yMDC1QNNMc P1N9
YAmabwCWla JU990
Tsoc1lqG6I Q1934B

If we are using JU991 as input, the expected output would be

1
2
3
4
id                     string
odGf0DZuUR JU9910
jemvgDUw6T UJ991
YAmabwCWla JU990

When you look at the results closely, you will identify the changes needed to change the input string to match the derived strings:

  1. Delete character(s) from the beginning/end
  2. Add characters(s) from the beginning/end
  3. Switch position for two characters

So, how do we create conditions that can catch the above changes?

Implementation: Condition 1 and 2

For conditions 1 and 2, we can do so by applying the formula:

1
abs(len(input) - len(record[i]))

to get the absolute value as ‘difference’ between two strings,
and then

1
isSubstringOf(input, record[i])

to determine whether the record is a substring of the input.

Implementation: Condition 3

This is a bit more tricky, because there exist a pretty straightforward solution:

1
sort(input) == sort(record[i])

However, I was not allowed to do so due to language constraints and there was no available sorting function.
Gladly, a hard-coded solution came up:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
sum(
positionOf('a', record[i]) == positionOf('a', input)
positionOf('b', record[i]) == positionOf('b', input)
positionOf('c', record[i]) == positionOf('c', input)
...
positionOf('A', record[i]) == positionOf('A', input)
positionOf('B', record[i]) == positionOf('B', input)
positionOf('C', record[i]) == positionOf('C', input)
...
positionOf('0', record[i]) == positionOf('0', input)
positionOf('1', record[i]) == positionOf('1', input)
positionOf('2', record[i]) == positionOf('2', input)
...
)

It is to count the sum of existing alphanumeric characters of the input and record. The same count means they are permutations of each other.

Hope this post can help those out there solving similarity problems!

#similarity #sql #summer-2017 #uber-internship