Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] InputColumn does not work properly with Spark columns that need escaping #1712

Closed
RobinL opened this issue Nov 8, 2023 · 2 comments · Fixed by #1719
Closed

[BUG] InputColumn does not work properly with Spark columns that need escaping #1712

RobinL opened this issue Nov 8, 2023 · 2 comments · Fixed by #1719

Comments

@RobinL
Copy link
Member

RobinL commented Nov 8, 2023

Reprex:

This works:

from splink.input_column import InputColumn
InputColumn("first_name", sql_dialect="spark").name_l()

This does not (note there's a space in first name in the below, but an underscore above:

from splink.input_column import InputColumn
InputColumn("first name", sql_dialect="spark").l_name_as_l()
Click to expand full traceback
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[~/Documents/data_linking/splink_311/try_spark.py](https://file+.vscode-resource.vscode-cdn.net/Users/robinlinacre/Documents/data_linking/splink_311/~/Documents/data_linking/splink_311/try_spark.py) in <cell line: 0>()
----> [16](file:///Users/robinlinacre/Documents/data_linking/splink_311/try_spark.py?line=15) InputColumn("first name", sql_dialect="spark").name_l()

[~/Documents/data_linking/splink_311/splink/input_column.py](https://file+.vscode-resource.vscode-cdn.net/Users/robinlinacre/Documents/data_linking/splink_311/~/Documents/data_linking/splink_311/splink/input_column.py) in name_l(self)
    143 
    144     def name_l(self):
--> 145         return add_suffix(self.input_name_as_tree, suffix="_l").sql(
    146             dialect=self._sql_dialect
    147         )

[~/Documents/data_linking/splink_311/splink/input_column.py](https://file+.vscode-resource.vscode-cdn.net/Users/robinlinacre/Documents/data_linking/splink_311/~/Documents/data_linking/splink_311/splink/input_column.py) in add_suffix(tree, suffix)
     24 def add_suffix(tree, suffix):
     25     tree = tree.copy()
---> 26     identifier_string = tree.find(exp.Identifier).this
     27     identifier_string = f"{identifier_string}{suffix}"
     28     tree.find(exp.Identifier).args["this"] = identifier_string

AttributeError: 'NoneType' object has no attribute 'this'

And hence a full example script with a first name column doesn't work:

click to expand script

from pyspark.context import SparkConf, SparkContext
from pyspark.sql import SparkSession

from splink.datasets import splink_datasets
from splink.spark.comparison_library import exact_match
from splink.spark.linker import SparkLinker

conf = SparkConf()

conf.set("spark.driver.memory", "12g")
conf.set("spark.sql.shuffle.partitions", "12")

sc = SparkContext.getOrCreate(conf=conf)
sc.setCheckpointDir("tmp_checkpoints/")
spark = SparkSession(sc)
df_pandas = splink_datasets.fake_1000
df = spark.createDataFrame(df_pandas)

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        "l.surname = r.surname",
    ],
    "comparisons": [
        exact_match("first name"),
        exact_match("city", term_frequency_adjustments=True),
    ],
}


linker = SparkLinker(df, settings)
linker.estimate_u_using_random_sampling(target_rows=1e2)

full traceback
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[~/Documents/data_linking/splink_311/try_spark.py](https://file+.vscode-resource.vscode-cdn.net/Users/robinlinacre/Documents/data_linking/splink_311/~/Documents/data_linking/splink_311/try_spark.py) in <cell line: 0>()
     [19](file:///Users/robinlinacre/Documents/data_linking/splink_311/try_spark.py?line=18)     ],
     [20](file:///Users/robinlinacre/Documents/data_linking/splink_311/try_spark.py?line=19)     "comparisons": [
---> [21](file:///Users/robinlinacre/Documents/data_linking/splink_311/try_spark.py?line=20)         exact_match("first name"),
     [22](file:///Users/robinlinacre/Documents/data_linking/splink_311/try_spark.py?line=21)         exact_match("city", term_frequency_adjustments=True),
     [23](file:///Users/robinlinacre/Documents/data_linking/splink_311/try_spark.py?line=22)     ],

[~/Documents/data_linking/splink_311/splink/comparison_library.py](https://file+.vscode-resource.vscode-cdn.net/Users/robinlinacre/Documents/data_linking/splink_311/~/Documents/data_linking/splink_311/splink/comparison_library.py) in __init__(self, col_name, regex_extract, valid_string_pattern, set_to_lowercase, term_frequency_adjustments, m_probability_exact_match, m_probability_else, include_colname_in_charts_label)
    101             "comparison_description": "Exact match vs. anything else",
    102             "comparison_levels": [
--> 103                 self._null_level(col_name, valid_string_pattern),
    104                 self._exact_match_level(
    105                     col_name,

[~/Documents/data_linking/splink_311/splink/comparison_level_library.py](https://file+.vscode-resource.vscode-cdn.net/Users/robinlinacre/Documents/data_linking/splink_311/~/Documents/data_linking/splink_311/splink/comparison_level_library.py) in __init__(self, col_name, valid_string_pattern, invalid_dates_as_null, valid_string_regex)
     99 
    100         col = InputColumn(col_name, sql_dialect=self._sql_dialect)
--> 101         col_name_l, col_name_r = col.name_l(), col.name_r()
    102 
    103         if invalid_dates_as_null:

[~/Documents/data_linking/splink_311/splink/input_column.py](https://file+.vscode-resource.vscode-cdn.net/Users/robinlinacre/Documents/data_linking/splink_311/~/Documents/data_linking/splink_311/splink/input_column.py) in name_l(self)
    143 
    144     def name_l(self):
--> 145         return add_suffix(self.input_name_as_tree, suffix="_l").sql(
    146             dialect=self._sql_dialect
    147         )

[~/Documents/data_linking/splink_311/splink/input_column.py](https://file+.vscode-resource.vscode-cdn.net/Users/robinlinacre/Documents/data_linking/splink_311/~/Documents/data_linking/splink_311/splink/input_column.py) in add_suffix(tree, suffix)
     24 def add_suffix(tree, suffix):
     25     tree = tree.copy()
---> 26     identifier_string = tree.find(exp.Identifier).this
     27     identifier_string = f"{identifier_string}{suffix}"
     28     tree.find(exp.Identifier).args["this"] = identifier_string

AttributeError: 'NoneType' object has no attribute 'this'

The problem is that in parse_input_name_to_sqlglot_tree

we have

  try:
      tree = sqlglot.parse_one(self.input_name, read=self._sql_dialect)
  except ParseError:
      tree = sqlglot.parse_one(f'"{self.input_name}"', read=self._sql_dialect)

  tree_signature = sqlglot_tree_signature(tree)
  valid_signatures = ["column identifier", "bracket column literal identifier"]

  if tree_signature in valid_signatures:
      return tree
  else:
      # e.g. SUR name parses to 'alias column identifier identifier'
      # but we want "SUR name"
      tree = sqlglot.parse_one(f'"{self.input_name}"', read=self._sql_dialect)
      return tree

We end up running the last part

tree = sqlglot.parse_one(f'"first_name"', read=self._sql_dialect)

And the tree is

(LITERAL this: first name, is_string: True)

when it should be

(COLUMN this: 
  (IDENTIFIER this: first_name, quoted: False))

Because it needs first_name not "first_name"

@hrmnjt
Copy link

hrmnjt commented Apr 17, 2024

Hi @RobinL

I've similar issue when experimenting with a dataset

Traceback (most recent call last):
  File "/tmp/spark-d62ac452-c572-431f-9b69-4272d10087ca/linkage.py", line 569, in <module>
    exec_linkage(spark, init_loc, partition_loc, cluster_loc, matches_loc)
  File "/tmp/spark-d62ac452-c572-431f-9b69-4272d10087ca/linkage.py", line 293, in exec_linkage
    cl.levenshtein_at_thresholds("customerName", 2),
  File "/home/hadoop/environment/lib64/python3.7/site-packages/splink/comparison_library.py", line 220, in __init__
    m_probability_else,
  File "/home/hadoop/environment/lib64/python3.7/site-packages/splink/comparison_library.py", line 118, in __init__
    comparison_levels.append(self._null_level(col_name))
  File "/home/hadoop/environment/lib64/python3.7/site-packages/splink/comparison_level_library.py", line 22, in __init__
    "sql_condition": f"{col.name_l()} IS NULL OR {col.name_r()} IS NULL",
  File "/home/hadoop/environment/lib64/python3.7/site-packages/splink/input_column.py", line 145, in name_l
    return add_suffix(self.input_name_as_tree, suffix="_l").sql(
  File "/home/hadoop/environment/lib64/python3.7/site-packages/splink/input_column.py", line 26, in add_suffix
    identifier_string = tree.find(exp.Identifier).this
AttributeError: 'NoneType' object has no attribute 'this'

Unfortunately, I'm not very well versed with Splink library. Can you suggest what is going wrong here?

@RobinL
Copy link
Member Author

RobinL commented Apr 17, 2024

Not sure, could be a version issue with sqlglot (e.g. try an earlier version). Would need to see the full code + a list of the exact packages you have installed to get any further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants