[BUG] `InputColumn` does not work properly with Spark columns that need escaping #1712

RobinL · 2023-11-08T15:14:21Z

Reprex:

This works:

from splink.input_column import InputColumn
InputColumn("first_name", sql_dialect="spark").name_l()

This does not (note there's a space in first name in the below, but an underscore above:

from splink.input_column import InputColumn
InputColumn("first name", sql_dialect="spark").l_name_as_l()

Click to expand full traceback

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[~/Documents/data_linking/splink_311/try_spark.py](https://file+.vscode-resource.vscode-cdn.net/Users/robinlinacre/Documents/data_linking/splink_311/~/Documents/data_linking/splink_311/try_spark.py) in <cell line: 0>()
----> [16](file:///Users/robinlinacre/Documents/data_linking/splink_311/try_spark.py?line=15) InputColumn("first name", sql_dialect="spark").name_l()

[~/Documents/data_linking/splink_311/splink/input_column.py](https://file+.vscode-resource.vscode-cdn.net/Users/robinlinacre/Documents/data_linking/splink_311/~/Documents/data_linking/splink_311/splink/input_column.py) in name_l(self)
    143 
    144     def name_l(self):
--> 145         return add_suffix(self.input_name_as_tree, suffix="_l").sql(
    146             dialect=self._sql_dialect
    147         )

[~/Documents/data_linking/splink_311/splink/input_column.py](https://file+.vscode-resource.vscode-cdn.net/Users/robinlinacre/Documents/data_linking/splink_311/~/Documents/data_linking/splink_311/splink/input_column.py) in add_suffix(tree, suffix)
     24 def add_suffix(tree, suffix):
     25     tree = tree.copy()
---> 26     identifier_string = tree.find(exp.Identifier).this
     27     identifier_string = f"{identifier_string}{suffix}"
     28     tree.find(exp.Identifier).args["this"] = identifier_string

AttributeError: 'NoneType' object has no attribute 'this'

And hence a full example script with a first name column doesn't work:

click to expand script


from pyspark.context import SparkConf, SparkContext
from pyspark.sql import SparkSession

from splink.datasets import splink_datasets
from splink.spark.comparison_library import exact_match
from splink.spark.linker import SparkLinker

conf = SparkConf()

conf.set("spark.driver.memory", "12g")
conf.set("spark.sql.shuffle.partitions", "12")

sc = SparkContext.getOrCreate(conf=conf)
sc.setCheckpointDir("tmp_checkpoints/")
spark = SparkSession(sc)
df_pandas = splink_datasets.fake_1000
df = spark.createDataFrame(df_pandas)

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.surname = r.surname",
    ],
    "comparisons": [
        exact_match("first name"),
        exact_match("city", term_frequency_adjustments=True),
    ],
}


linker = SparkLinker(df, settings)
linker.estimate_u_using_random_sampling(target_rows=1e2)

full traceback

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[~/Documents/data_linking/splink_311/try_spark.py](https://file+.vscode-resource.vscode-cdn.net/Users/robinlinacre/Documents/data_linking/splink_311/~/Documents/data_linking/splink_311/try_spark.py) in <cell line: 0>()
     [19](file:///Users/robinlinacre/Documents/data_linking/splink_311/try_spark.py?line=18)     ],
     [20](file:///Users/robinlinacre/Documents/data_linking/splink_311/try_spark.py?line=19)     "comparisons": [
---> [21](file:///Users/robinlinacre/Documents/data_linking/splink_311/try_spark.py?line=20)         exact_match("first name"),
     [22](file:///Users/robinlinacre/Documents/data_linking/splink_311/try_spark.py?line=21)         exact_match("city", term_frequency_adjustments=True),
     [23](file:///Users/robinlinacre/Documents/data_linking/splink_311/try_spark.py?line=22)     ],

[~/Documents/data_linking/splink_311/splink/comparison_library.py](https://file+.vscode-resource.vscode-cdn.net/Users/robinlinacre/Documents/data_linking/splink_311/~/Documents/data_linking/splink_311/splink/comparison_library.py) in __init__(self, col_name, regex_extract, valid_string_pattern, set_to_lowercase, term_frequency_adjustments, m_probability_exact_match, m_probability_else, include_colname_in_charts_label)
    101             "comparison_description": "Exact match vs. anything else",
    102             "comparison_levels": [
--> 103                 self._null_level(col_name, valid_string_pattern),
    104                 self._exact_match_level(
    105                     col_name,

[~/Documents/data_linking/splink_311/splink/comparison_level_library.py](https://file+.vscode-resource.vscode-cdn.net/Users/robinlinacre/Documents/data_linking/splink_311/~/Documents/data_linking/splink_311/splink/comparison_level_library.py) in __init__(self, col_name, valid_string_pattern, invalid_dates_as_null, valid_string_regex)
     99 
    100         col = InputColumn(col_name, sql_dialect=self._sql_dialect)
--> 101         col_name_l, col_name_r = col.name_l(), col.name_r()
    102 
    103         if invalid_dates_as_null:

[~/Documents/data_linking/splink_311/splink/input_column.py](https://file+.vscode-resource.vscode-cdn.net/Users/robinlinacre/Documents/data_linking/splink_311/~/Documents/data_linking/splink_311/splink/input_column.py) in name_l(self)
    143 
    144     def name_l(self):
--> 145         return add_suffix(self.input_name_as_tree, suffix="_l").sql(
    146             dialect=self._sql_dialect
    147         )

[~/Documents/data_linking/splink_311/splink/input_column.py](https://file+.vscode-resource.vscode-cdn.net/Users/robinlinacre/Documents/data_linking/splink_311/~/Documents/data_linking/splink_311/splink/input_column.py) in add_suffix(tree, suffix)
     24 def add_suffix(tree, suffix):
     25     tree = tree.copy()
---> 26     identifier_string = tree.find(exp.Identifier).this
     27     identifier_string = f"{identifier_string}{suffix}"
     28     tree.find(exp.Identifier).args["this"] = identifier_string

AttributeError: 'NoneType' object has no attribute 'this'

The problem is that in parse_input_name_to_sqlglot_tree

we have

  try:
      tree = sqlglot.parse_one(self.input_name, read=self._sql_dialect)
  except ParseError:
      tree = sqlglot.parse_one(f'"{self.input_name}"', read=self._sql_dialect)

  tree_signature = sqlglot_tree_signature(tree)
  valid_signatures = ["column identifier", "bracket column literal identifier"]

  if tree_signature in valid_signatures:
      return tree
  else:
      # e.g. SUR name parses to 'alias column identifier identifier'
      # but we want "SUR name"
      tree = sqlglot.parse_one(f'"{self.input_name}"', read=self._sql_dialect)
      return tree

We end up running the last part

tree = sqlglot.parse_one(f'"first_name"', read=self._sql_dialect)

And the tree is

(LITERAL this: first name, is_string: True)

when it should be

(COLUMN this: 
  (IDENTIFIER this: first_name, quoted: False))

Because it needs first_name not "first_name"

The text was updated successfully, but these errors were encountered:

hrmnjt · 2024-04-17T13:21:23Z

Hi @RobinL

I've similar issue when experimenting with a dataset

Traceback (most recent call last):
  File "/tmp/spark-d62ac452-c572-431f-9b69-4272d10087ca/linkage.py", line 569, in <module>
    exec_linkage(spark, init_loc, partition_loc, cluster_loc, matches_loc)
  File "/tmp/spark-d62ac452-c572-431f-9b69-4272d10087ca/linkage.py", line 293, in exec_linkage
    cl.levenshtein_at_thresholds("customerName", 2),
  File "/home/hadoop/environment/lib64/python3.7/site-packages/splink/comparison_library.py", line 220, in __init__
    m_probability_else,
  File "/home/hadoop/environment/lib64/python3.7/site-packages/splink/comparison_library.py", line 118, in __init__
    comparison_levels.append(self._null_level(col_name))
  File "/home/hadoop/environment/lib64/python3.7/site-packages/splink/comparison_level_library.py", line 22, in __init__
    "sql_condition": f"{col.name_l()} IS NULL OR {col.name_r()} IS NULL",
  File "/home/hadoop/environment/lib64/python3.7/site-packages/splink/input_column.py", line 145, in name_l
    return add_suffix(self.input_name_as_tree, suffix="_l").sql(
  File "/home/hadoop/environment/lib64/python3.7/site-packages/splink/input_column.py", line 26, in add_suffix
    identifier_string = tree.find(exp.Identifier).this
AttributeError: 'NoneType' object has no attribute 'this'

Unfortunately, I'm not very well versed with Splink library. Can you suggest what is going wrong here?

RobinL · 2024-04-17T14:07:55Z

Not sure, could be a version issue with sqlglot (e.g. try an earlier version). Would need to see the full code + a list of the exact packages you have installed to get any further.

RobinL mentioned this issue Nov 9, 2023

Fix InputColumn quoting for spark and improve code quality #1719

Merged

RobinL linked a pull request Nov 9, 2023 that will close this issue

Fix InputColumn quoting for spark and improve code quality #1719

Merged

RobinL closed this as completed in #1719 Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `InputColumn` does not work properly with Spark columns that need escaping #1712

[BUG] `InputColumn` does not work properly with Spark columns that need escaping #1712

RobinL commented Nov 8, 2023 •

edited

Loading

hrmnjt commented Apr 17, 2024

RobinL commented Apr 17, 2024

[BUG] InputColumn does not work properly with Spark columns that need escaping #1712

[BUG] InputColumn does not work properly with Spark columns that need escaping #1712

Comments

RobinL commented Nov 8, 2023 • edited Loading

hrmnjt commented Apr 17, 2024

RobinL commented Apr 17, 2024

[BUG] `InputColumn` does not work properly with Spark columns that need escaping #1712

[BUG] `InputColumn` does not work properly with Spark columns that need escaping #1712

RobinL commented Nov 8, 2023 •

edited

Loading