Skip to main content

Fuzzy String Matching in SQL: Using Levenshtein Similarity in CData Virtuality

  • July 18, 2025
  • 0 replies
  • 27 views

Marijan
Forum|alt.badge.img+1

Use case

Compare strings that look similar but aren’t exactly the same — such as customer names, addresses, or product titles — in SQL queries.

We are happy to announce that the development of built in functions to calculate string similarity using different algorithms is under way with the expected release by the end of summer 2025. In the meantime, here is an example on how to achieve this by creating your own virtual procedure.

 

Why Fuzzy Matching?

When integrating data from multiple sources or cleaning up inconsistent records, exact string matching (‘=’ or ‘LIKE’) often falls short. For example:

  • "John Smith" vs. "Jon Smith"

  • "Mary Johnson" vs. "Mary Jonson"

Despite being nearly identical, these will not match using standard SQL operators.

This is where Levenshtein similarity becomes powerful — it quantifies how "close" two strings are based on the minimum number of edits (insertions, deletions, substitutions) needed to convert one into the other.

 

Solution: Levenshtein Similarity as a Virtual Procedure

You can embed a Levenshtein algorithm in JavaScript and use it inside a CData Virtuality Virtual Procedure to return a similarity score between 0 and 1.

 

Procedure Definition

 

CREATE OR REPLACE VIRTUAL PROCEDURE views.similarity_levenshtein (

  IN string1 string,

  IN string2 string

)

RETURNS (

  string1 string,

  string2 string,

  similarity double

)

AS

BEGIN

  SELECT string1, string2, similarity

    FROM OBJECTTABLE(

      LANGUAGE 'javascript'

        'function levenshtein(a, b) {

           if (a === b) return 1.0;

           if (!a || !b) return 0.0;

           var m = a.length;

           var n = b.length;

           var dp = [];

           // Initialize the dp array

           for (var i = 0; i <= m; i++) {

             dp[i] = [];

             dp[i][0] = i;

           }

           for (var j = 1; j <= n; j++) {

             dp[0][j] = j;

           }

           // Fill the dp table

           for (var i = 1; i <= m; i++) {

             for (var j = 1; j <= n; j++) {

               if (a.charAt(i-1) === b.charAt(j-1)) {

                 dp[i][j] = dp[i-1][j-1];

               } else {

                 dp[i][j] = Math.min(

                   dp[i-1][j] + 1, // deletion

                   dp[i][j-1] + 1, // insertion

                   dp[i-1][j-1] + 1 // substitution

                 );

               }

             }

           }

           var distance = dp[m][n];

           var maxLen = Math.max(m, n);

           var similarity = maxLen === 0 ? 1.0 : 1.0 - (distance / maxLen);

           return similarity;

       }

       // Execute the function

       var result = levenshtein(string1, string2);

       // Set the dv_row for the output

       dv_row = {

         string1: string1,

         string2: string2,

         similarity: result

       };'

       PASSING string1 AS string1,

               string2 AS string2

       COLUMNS

         string1 string 'dv_row.string1',

         string2 string 'dv_row.string2',

         similarity double 'dv_row.similarity'

     ) AS a;

END;;

 

How to use it in SQL joins

Let’s look at two examples for joining records based on string similarity:

 

Example 1: Joining Two Customer Tables

WITH

  customers_a AS (

    SELECT 1 as id, 'John Smith' as name

    UNION ALL SELECT 2, 'Mary Johnson'

  ),

  customers_b AS (

    SELECT 101 as id, 'Jon Smith' as name

    UNION ALL SELECT 102, 'Mary Jonson'

  )

  SELECT

    a.id as id_a,

    a.name as name_a,

    b.id as id_b,

    b.name as name_b,

    sim.similarity

  FROM

    customers_a a, customers_b b

    , LATERAL(

      SELECT similarity FROM views.similarity_levenshtein(a.name, b.name)

    ) sim

  WHERE

    sim.similarity > 0.8;

Result: Returns likely matches such as "John Smith" and "Jon Smith" with a similarity score over 0.8.

 

Example 2: Using a Scalar Subquery (Alternative Syntax)

WITH

  customers_a AS (

    SELECT

      1 AS id,

      'John Smith' AS NAME

     UNION ALL

     SELECT

       2,

       'Mary Johnson'

  ),

  customers_b AS (

    SELECT

      101 AS id,

      'Jon Smith' AS NAME

    UNION ALL

    SELECT

      102,

      'Mary Jonson'

     )

  SELECT

    a.id AS id_a,

    a.name AS name_a,

    b.id AS id_b,

    b.name AS name_b

  FROM

    customers_a a,

    customers_b b

  WHERE

    (

     SELECT

       similarity

      FROM

  VIEWS.similarity_levenshtein (a.name, b.name)

  ) > 0.8;

 Note: This version works fine but doesn't expose the similarity value in the output — use lateral joins for better traceability.

 

When to Use This

Use Levenshtein similarity when:

  • Matching misspelled customer names, product names, addresses

  • Detecting duplicates across systems

  • Preprocessing fuzzy joins for human validation

  • Scoring "likelihood of match" for machine learning pipelines

Tips for Production Use

  • For performance, cache similarity results in intermediate tables if comparing large datasets.

  • Consider using a threshold like 0.85+ for high confidence matches.

  • Pair with soundex/metaphone phonetic algorithms for better results across dialects.

 

This topic has been closed for replies.