-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Newline in CSV quoted string breaks reader #72
Comments
@jondot The issue is that Beam doesn't natively support reading from CSV data. So we currently get around this by reading line-by-line and parsing each line as a CSV record. |
@jondot , |
yup understood. I believe since compliant CSV can include multiline quoted fields, somehow tfdv should support that. but of course It's up to you. |
Let's keep it open so that users are aware that this issue exists and don't end up creating new issues. |
Thanks guys. If you think you know how you want it to be implemented in terms of design and standards, I'll be happy to put in the time to implement. Though, I stay away from CLAs and that kind of things, so I'll bow out if you've got CLAs. |
Hi,
Looks like current CSV reader does not support the case where a quoted string value span a few lines (and line breaks are made). It means a logical CSV row may span a few physical lines, which is valid CSV.
Looks like some if this is indicated here? https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/coders/csv_decoder.py#L150
And some background here:
https://stackoverflow.com/questions/18724903/csvs-in-python-with-newline-in-quotes
A question,
There's probably a reason for it, but why not use an actual csv reader?
Edit: i'm assuming because streaming, beam, etc. want a unit = line, which makes parallelism possible.
The text was updated successfully, but these errors were encountered: