Use case
Compare strings that look similar but aren’t exactly the same — such as customer names, addresses, or product titles — in SQL queries.
We are happy to announce that the development of built in functions to calculate string similarity using different algorithms is under way with the expected release by the end of summer 2025. In the meantime, here is an example on how to achieve this by creating your own virtual procedure.
Why Fuzzy Matching?
When integrating data from multiple sources or cleaning up inconsistent records, exact string matching (‘=’ or ‘LIKE’) often falls short. For example:
-
"John Smith"vs."Jon Smith" -
"Mary Johnson"vs."Mary Jonson"
Despite being nearly identical, these will not match using standard SQL operators.
This is where Levenshtein similarity becomes powerful — it quantifies how "close" two strings are based on the minimum number of edits (insertions, deletions, substitutions) needed to convert one into the other.
Solution: Levenshtein Similarity as a Virtual Procedure
You can embed a Levenshtein algorithm in JavaScript and use it inside a CData Virtuality Virtual Procedure to return a similarity score between 0 and 1.
Procedure Definition
CREATE OR REPLACE VIRTUAL PROCEDURE views.similarity_levenshtein (
IN string1 string,
IN string2 string
)
RETURNS (
string1 string,
string2 string,
similarity double
)
AS
BEGIN
SELECT string1, string2, similarity
FROM OBJECTTABLE(
LANGUAGE 'javascript'
'function levenshtein(a, b) {
if (a === b) return 1.0;
if (!a || !b) return 0.0;
var m = a.length;
var n = b.length;
var dp = [];
// Initialize the dp array
for (var i = 0; i <= m; i++) {
dp[i] = [];
dp[i][0] = i;
}
for (var j = 1; j <= n; j++) {
dp[0][j] = j;
}
// Fill the dp table
for (var i = 1; i <= m; i++) {
for (var j = 1; j <= n; j++) {
if (a.charAt(i-1) === b.charAt(j-1)) {
dp[i][j] = dp[i-1][j-1];
} else {
dp[i][j] = Math.min(
dp[i-1][j] + 1, // deletion
dp[i][j-1] + 1, // insertion
dp[i-1][j-1] + 1 // substitution
);
}
}
}
var distance = dp[m][n];
var maxLen = Math.max(m, n);
var similarity = maxLen === 0 ? 1.0 : 1.0 - (distance / maxLen);
return similarity;
}
// Execute the function
var result = levenshtein(string1, string2);
// Set the dv_row for the output
dv_row = {
string1: string1,
string2: string2,
similarity: result
};'
PASSING string1 AS string1,
string2 AS string2
COLUMNS
string1 string 'dv_row.string1',
string2 string 'dv_row.string2',
similarity double 'dv_row.similarity'
) AS a;
END;;
How to use it in SQL joins
Let’s look at two examples for joining records based on string similarity:
Example 1: Joining Two Customer Tables
WITH
customers_a AS (
SELECT 1 as id, 'John Smith' as name
UNION ALL SELECT 2, 'Mary Johnson'
),
customers_b AS (
SELECT 101 as id, 'Jon Smith' as name
UNION ALL SELECT 102, 'Mary Jonson'
)
SELECT
a.id as id_a,
a.name as name_a,
b.id as id_b,
b.name as name_b,
sim.similarity
FROM
customers_a a, customers_b b
, LATERAL(
SELECT similarity FROM views.similarity_levenshtein(a.name, b.name)
) sim
WHERE
sim.similarity > 0.8;
Result: Returns likely matches such as "John Smith" and "Jon Smith" with a similarity score over 0.8.
Example 2: Using a Scalar Subquery (Alternative Syntax)
WITH
customers_a AS (
SELECT
1 AS id,
'John Smith' AS NAME
UNION ALL
SELECT
2,
'Mary Johnson'
),
customers_b AS (
SELECT
101 AS id,
'Jon Smith' AS NAME
UNION ALL
SELECT
102,
'Mary Jonson'
)
SELECT
a.id AS id_a,
a.name AS name_a,
b.id AS id_b,
b.name AS name_b
FROM
customers_a a,
customers_b b
WHERE
(
SELECT
similarity
FROM
VIEWS.similarity_levenshtein (a.name, b.name)
) > 0.8;
Note: This version works fine but doesn't expose the similarity value in the output — use lateral joins for better traceability.
When to Use This
Use Levenshtein similarity when:
-
Matching misspelled customer names, product names, addresses
-
Detecting duplicates across systems
-
Preprocessing fuzzy joins for human validation
-
Scoring "likelihood of match" for machine learning pipelines
Tips for Production Use
-
For performance, cache similarity results in intermediate tables if comparing large datasets.
-
Consider using a threshold like 0.85+ for high confidence matches.
-
Pair with soundex/metaphone phonetic algorithms for better results across dialects.

