Skip to content

Delimited Identifiers

Martin Traverso edited this page Jul 22, 2019 · 6 revisions

The intent of this document is to capture the high-level thoughts and ideas about how to add support for case sensitive identifiers.

Specification

<delimited identifier> ::=
  <double quote> <delimited identifier body> <double quote>

<delimited identifier body> ::=  <delimited identifier part>...
<delimited identifier part> ::=
    <nondoublequote character>
  | <doublequote symbol>

<Unicode delimited identifier> ::=
  U <ampersand> <double quote> <Unicode delimiter body> <double quote>
      <Unicode escape specifier>
<Unicode escape specifier> ::=
  [ UESCAPE <quote> <Unicode escape character> <quote> ]
<Unicode delimiter body> ::=
  <Unicode identifier part>...
<Unicode identifier part> ::=
    <delimited identifier part>
  | <Unicode escape value>
24) For every <identifier body> IB there is exactly one corresponding case-normal form CNF. CNF is an <identifier body> derived from IB as follows:
Let n be the number of characters in IB. For i ranging from 1 (one) to n, the i-th character Mi of IB is transliterated into the corresponding character 
or characters of CNF as follows:
Case:
   a) If Mi is a lower case character or a title case character for which an equivalent upper case sequence U is de ned by Unicode, then let j be th
       e number of characters in U; the next j characters of CNF are U.
   b) Otherwise, the next character of CNF is Mi.
25) The case-normal form of the <identifier body> of a <regular identifier> is used for purposes such as and including determination of identifier 
      equivalence, representation in the Definition and Information Schemas, and representation in diagnostics areas.

...

27) Two <regular identifier>s are equivalent if the case-normal forms of their <identifier body>s, considered as the repetition of a <character string literal> 
that specifies a <character set specification> of SQL_IDENTIFIER and an implementation-defined collation IDC that is sensitive to case, compare equally 
according to the comparison rules in Subclause 8.2, “<comparison predicate>”.

28) A <regular identifier> and a <delimited identifier> are equivalent if the case-normal form of the <identifier body> of the <regular identifier> and the 
<delimited identifier body> of the <delimited identifier> (with all occurrences of <quote> replaced by <quote symbol> and all occurrences of 
<doublequote symbol> replaced by <double quote>), considered as the repetition of a <character string literal> that specifies a <character set specification>
 of SQL_IDENTIFIER and IDC, compare equally according to the comparison rules in Subclause 8.2, “<comparison predicate>”.


29) Two<delimited identifier>s are equivalent if their <delimited identifierbody>s,considered as the repetition of a <character string literal> that specifies
 a <character set specification> of SQL_IDENTIFIER and an implementation-defined collation that is sensitive to case, compare equally according to the
 comparison rules in Subclause 8.2, “<comparison predicate>”.

30) Two <Unicode delimited identifier>s are equivalent if their <Unicode delimiter body>s, considered as the repetition of a <character string literal> that
 specifies a <character set specification> of SQL_IDENTIFIER and an implementation-defined collation that is sensitive to case, compare equally according
 to the comparison rules in Subclause 8.2, “<comparison predicate>”.

31) A <Unicode delimited identifier> and a <delimited identifier> are equivalent if their <Unicode delimiter body> and <delimited identifier body>, 
respectively, each considered as the repetition of a <character string literal> that specifies a <character set specification> of SQL_IDENTIFIER and 
an implementation-defined collation that is sensitive to case, compare equally according to the comparison rules in Subclause 8.2, “<comparison predicate>”.

32) A <regular identifier> and a <Unicode delimited identifier> are equivalent if the case-normal form of the <identifier body> of the <regular identifier> 
and the <Unicode delimiter body> of the <Unicode delimited identifier> considered as the repetition of a <character string literal>, each specifying a
 <character set specification> of SQL_IDENTIFIER and an implementation-defined collation that is sensitive to case, compare equally according to the 
comparison rules in Subclause 8.2, “<comparison predicate>”.

In a nutshell, when comparing identifiers, if any of the identifiers is non-delimited it needs to be canonicalized (upper case) and then compared using regular string comparison operations.

The book "SQL: 1999: Understanding Relational Language Components" by Jim Melton, one of the editors of the SQL spec says:

In effect, SQL:1999 changes all lowercase letters in regular identifiers to their uppercase-equivalent letters. This is especially important for identifiers that are stored as data values in the views of the Information Schema. [...] When a delimited identifier is stored into the views of the Information Schema, the double quotes are not stored, but all the other characters are, just as they appeared in the delimited identifier. Therefore, the regular identifier TITLES and the delimited identifier "TITLES" are stored identically and are therefore completely equivalent.

Goals

  • This should allow us to capture the case sensitive identifiers (schemaName, tableName, functions, type etc) to the connector.
  • For connector which doesn't comes under this specification (like Hive) we should do a case insensitive matching.(including legacy connectors).

Considerations

Connectors can be classified in one of these categories:

  • SQL compliant (e.g., Oracle, DB2)
    • Delimited identifiers are significant
    • Normalizes to upper case
  • Non-compliant
    • Case-sensitive (E.g., PostgreSQL)
      • Delimited identifiers are significant
      • Normalizes to something other than upper-case
    • Case-insensitive (e.g., Hive)
      • Delimited identifiers behave as un-delimited
      • May normalize identifiers

Presto interacts with identifiers in a few different ways:

  • Matching for the purpose of looking up entities (catalogs, schemas, tables, views, users, roles, session properties)
    • Columns are a special case, since they are resolved by the engine based on table metadata provided by connectors
  • Providing identifiers to connectors when creating entities
  • Presenting and processing entity names returned by connectors, such as in column resolution or any of the metadata introspection commands (SHOW, DESCRIBE, etc).

Approaches

Approach 1

These specifications can be applied at the SQL language level so after the query has been parsed we can capture the normalized value (as per the spec) in a plain String and pass it to the connector. This approach requires to remove the check and transformation of the input String to lower case irrespective whether it is delimited or not.

Approach 2

Similar to an Identifier in AST we can create an object Name which maintains the raw string and a boolean to represent whether it is delimited or not and we pass this object to the connector. This approach involves changing all the APIs to move from String to this new object (for Metadata, AccessControl APIs)

Column Matching

Table and Schemas are matched by the connector whereas the column matching is performed by the query engine. So now matching of columns should be performed as per the spec instead of the legacy way. We need to add a session property to maintain the legacy way of matching the column.