I am trying to write a SQL query where I check string values in a column and if the column contains names that are in the name_list that have a hyphen before or after the target name, I want to flag these as -1. (ie. "John-Brown" or "Brown-John" would be flagged as -1). If the name exists at all, but is not hyphenated, then we will return a 1 flag (ie. "Johnny Appleseed" would be flagged as 1) and lastly if the name is not in the column, then we will return a 0.
WITH name_list AS (
SELECT STRING_AGG(name, '|') AS name_pattern
FROM UNNEST(['John', 'David', 'Alice', 'Michael']) AS name
)
SELECT
t.id,
t.column_to_check,
CASE
-- Case 1: Word is preceded or followed by a hyphen → return -1
WHEN REGEXP_CONTAINS(t.column_to_check, r'[-](' || nl.name_pattern || r')[-]') THEN -1
-- Case 2: Word exists anywhere (unless hyphenated) → return 1
WHEN REGEXP_CONTAINS(t.column_to_check, nl.name_pattern) THEN 1
ELSE 0
END AS flag
FROM table_name t
JOIN name_list nl ON TRUE;
This should work, however, with the real data that I am working with, I have 95 different names that I am checking and millions of rows of data, and I am getting the error:
Cannot parse regular expression: pattern too large - compile failed.
Does anyone know of another, efficient way to do this?