Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newline in CSV quoted string breaks reader #72

Open
jondot opened this issue Jul 8, 2019 · 5 comments
Open

Newline in CSV quoted string breaks reader #72

jondot opened this issue Jul 8, 2019 · 5 comments

Comments

@jondot
Copy link

jondot commented Jul 8, 2019

Hi,
Looks like current CSV reader does not support the case where a quoted string value span a few lines (and line breaks are made). It means a logical CSV row may span a few physical lines, which is valid CSV.

Looks like some if this is indicated here? https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/coders/csv_decoder.py#L150

And some background here:

https://stackoverflow.com/questions/18724903/csvs-in-python-with-newline-in-quotes

A question,
There's probably a reason for it, but why not use an actual csv reader?
Edit: i'm assuming because streaming, beam, etc. want a unit = line, which makes parallelism possible.

@jondot jondot changed the title Newline in quoted string breaks reader Newline in CSV quoted string breaks reader Jul 8, 2019
@gowthamkpr gowthamkpr self-assigned this Jul 8, 2019
@paulgc
Copy link
Member

paulgc commented Jul 8, 2019

@jondot The issue is that Beam doesn't natively support reading from CSV data. So we currently get around this by reading line-by-line and parsing each line as a CSV record.

@aaltay @chamikaramj @katsiapis

@rmothukuru
Copy link

@jondot ,
As the issue raised by you depends on the functionality, which Beam currently doesn't support, please confirm if we can close this issue, or if you want it to be implemented as a Feature, once Beam supports it.
Thanks.

@jondot
Copy link
Author

jondot commented Jul 16, 2019

yup understood. I believe since compliant CSV can include multiline quoted fields, somehow tfdv should support that. but of course It's up to you.

@paulgc
Copy link
Member

paulgc commented Jul 16, 2019

Let's keep it open so that users are aware that this issue exists and don't end up creating new issues.

@jondot
Copy link
Author

jondot commented Jul 16, 2019

Thanks guys. If you think you know how you want it to be implemented in terms of design and standards, I'll be happy to put in the time to implement. Though, I stay away from CLAs and that kind of things, so I'll bow out if you've got CLAs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants