Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider phrase breaks? #24

Open
jlevy opened this issue Mar 17, 2019 · 7 comments
Open

Consider phrase breaks? #24

jlevy opened this issue Mar 17, 2019 · 7 comments
Labels
enhancement New feature or request thinking

Comments

@jlevy
Copy link
Owner

jlevy commented Mar 17, 2019

Some seem to desire breaks on phrases like commas or clauses. This might or might not be a good idea, so listing here to track+discuss.

@jlevy jlevy added the enhancement New feature or request label Mar 17, 2019
@jlevy
Copy link
Owner Author

jlevy commented Mar 17, 2019

(From @ivanistheone in #13:)

I was excited to try this package but it didn't wrap things accoring to semantic -- e.g. on commas and or other logical clauses.

Expected:

Data is what makes machine learning work.
Sure clever math solutions and optimized algorithms play an important role too,
but it's really the data that is the differentiating factor.
Specifically,
we're talking about a source of plentiful, high quality, structured, clean,
and well labelled examples of the machine learning task to be performed.

The features we use to represent each instance of a machine learning task are of central importance for the overall success of a machine learning system.
Indeed,
machine learning practitioners in the industry often describe most of the performance gains they observe come from using better features,
rather then using fancy machine learning models.
Luckily there the field of \emph{feature engineering} exists,
which consists of an arsenal of best practices and tricks for associating the most useful feature vectors as possible for each instance of the dataset.

Observed after reformat + wrap:

Data is what makes machine learning work.
Sure clever math solutions and optimized algorithms play an important role too, but it's
really the data that is the differentiating factor.
Specifically, we're talking about a source of plentiful, high quality, structured, clean,
and well labelled examples of the machine learning task to be performed.

The features we use to represent each instance of a machine learning task are of central
importance for the overall success of a machine learning system.
Indeed, machine learning practitioners in the industry often describe most of the
performance gains they observe come from using better features, rather then using fancy
machine learning models.
Luckily there the field of \\emph{feature engineering} exists, which consists of an
arsenal of best practices and tricks for associating the most useful feature vectors as
possible for each instance of the dataset.

specifically I'd expect the but it's to be on the the next line.

@jlevy
Copy link
Owner Author

jlevy commented Mar 17, 2019

Current intended behavior is
(1) to break on sentences (unless they are so short they might not be sentences at all, in which case we err on the side of not breaking)
(2) to emphasize simplicity and language neutrality (e.g. not to use any overly complex NLP or crazy rules that would make this not work or be unpredictably nondeterministic as the package evolves)

It's possible though that breaking on longer phrases is a good idea, but we'd need simple rules. It might also be harder to explain and get used to.

@ivanistheone did you have any thoughts or use cases on why you'd prefer phrase breaks to sentence breaks?

This could also be a flag, but that comes at a cost too.

@ivanistheone
Copy link

ivanistheone commented Mar 17, 2019

did you have any thoughts or use cases on why you'd prefer phrase breaks to sentence breaks?

The high-level reason is that phrases are the smallest coherent unit, so it makes sense to see them each on their own line. Similar to how one paragraph contains one idea, each phrase is one coherent building block used to construct that idea.

There are also several pracrical, low-level reasons for the one-phrase-per-line approach:

  • In my experience working on books, I do a lot of exiting and moving around, which is made much easier when I can do "surgery" on the text with only vertical selection commands (always cutting entire chunks)
  • the long phrases in the source code stick out visually and serve as red flags for parts that need to be simplified --- e.g. if you write a run-on phrase that is 200+ characters it will be clearly visible and the source code uglyness might prompt you to shorten it or simplify
  • similarly use of introductory words + comma like Indeed, However, etc. becomes apparent (ragged, ugly source code), which forces me to use them sparingly.
  • one phrase per line makes github diffs look nice --- although thanks to diff --color-words and latexdiff this is not so important when working on command line
  • more thoughts on that here https://rhodesmill.org/brandon/2012/one-sentence-per-line/ (although I think sub-prase line breaks might be going too far)

@jlevy
Copy link
Owner Author

jlevy commented Apr 1, 2019

Thanks! Yes, am familiar with most of goals (some more discussion here if you're interested).

Your first benefit is interesting, for sure, and perhaps better with phrase breaks than sentence breaks. The 2nd and 3rd are interesting as well, but I'm not sure every editor would share this perception, so I'd hesitate to make it default always. Note the 4th and 5th are mostly already benefits with sentence-per-line-with-wrap-on-overflow, the current behavior.

At Holloway, we use Flowmark on large documents with several committers pretty effectively and I've found it's a good, realistic compromise so far that balances semantic editing and stability with keeping the text sane looking. But will leave this open; perhaps this could be a setting in the future, and glad to see if anyone else asks for it!

@jlevy jlevy added the thinking label Apr 1, 2019
@asford
Copy link

asford commented Feb 15, 2021

This form of semantic break would be useful in formatting markdown documentation.
Adjusting the split rules to split-on-semantic-breaks-if-required would be useful.

For example, a heuristic that splits-on-semantic-breaks at >%~60 of maximum text width,
but splits a non-semantic breaks if required to stay below max width?

Eg:

Data is what makes machine learning work.
Sure clever math solutions and optimized algorithms play an important role too,
but it's really the data that is the differentiating factor.
Specifically, we're talking about a source of plentiful, high quality, structured, clean,
and well labelled examples of the machine learning task to be performed.

The features we use to represent each instance of a machine learning task are of central
importance for the overall success of a machine learning system.
Indeed, machine learning practitioners in the industry often describe most of the
performance gains they observe come from using better features,
rather then using fancy machine learning models.

Luckily there the field of \emph{feature engineering} exists,
which consists of an arsenal of best practices and tricks for associating the most 
useful feature vectors as possible for each instance of the dataset.

@asford
Copy link

asford commented Feb 16, 2021

As a quick follow-up... the compromise .clauses selector may be a good ~80% effect, ~10%
effort solution for this style of splitting.

I've played around with it with this test data, https://observablehq.com/d/59be2e7af575c8ad, and it definitely has warts but could be a good start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request thinking
Projects
None yet
Development

No branches or pull requests

5 participants
@ivanistheone @asford @jlevy and others