Solving String Similarity in SQL

Steps to solve string similarity in SQL.

Posted in Code on June 8, 2017  – 2 min read

Problem

The main focus today was to identify the similarity between two strings and derive the relevant records from a query.

To better illustrate this, let’s look at a sample table:

id                     string
mjZThqXnp7             QB1934
odGf0DZuUR             JU9910
jemvgDUw6T             UJ991
xkVGZbpG05             DF987
MtaNYMMgkV             FL9111
jqZCZI0Fdn             AM9114
xRNLSPEdAY             AM914
yMDC1QNNMc             P1N9
YAmabwCWla             JU990
Tsoc1lqG6I             Q1934B

If we are using JU991 as input, the expected output would be

id                     string
odGf0DZuUR             JU9910
jemvgDUw6T             UJ991
YAmabwCWla             JU990

When you look at the results closely, you will identify the changes needed to change the input string to match the derived strings:

  1. Delete character(s) from the beginning/end
  2. Add characters(s) from the beginning/end
  3. Switch position for two characters

So, how do we create conditions that can catch the above changes?

Implementation: Condition 1 and 2

For conditions 1 and 2, we can do so by applying the formula:

abs(len(input) - len(record[i]))

to get the absolute value as ‘difference’ between two strings, and then

isSubstringOf(input, record[i])

to determine whether the record is a substring of the input.

Implementation: Condition 3

This is a bit more tricky, because there exist a pretty straightforward solution:

sort(input) == sort(record[i])

However, I was not allowed to do so due to language constraints and there was no available sorting function. Gladly, a hard-coded solution came up:

sum(
    positionOf('a', record[i]) == positionOf('a', input)
    positionOf('b', record[i]) == positionOf('b', input)
    positionOf('c', record[i]) == positionOf('c', input)
    ...
    positionOf('A', record[i]) == positionOf('A', input)
    positionOf('B', record[i]) == positionOf('B', input)
    positionOf('C', record[i]) == positionOf('C', input)
    ...
    positionOf('0', record[i]) == positionOf('0', input)
    positionOf('1', record[i]) == positionOf('1', input)
    positionOf('2', record[i]) == positionOf('2', input)
    ...
)

It is to count the sum of existing alphanumeric characters of the input and record. The same count means they are permutations of each other.

Hope this post can help those out there solving similarity problems!