Skip to content

Commit

Permalink
Merge pull request #1026 from IBM/perf-readability
Browse files Browse the repository at this point in the history
Readability transform: performance improvement and adding score_list argument
  • Loading branch information
touma-I authored Feb 11, 2025
2 parents e3ce06e + 5c19c86 commit d991412
Show file tree
Hide file tree
Showing 14 changed files with 942 additions and 6,394 deletions.
43 changes: 25 additions & 18 deletions transforms/language/readability/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,13 +72,19 @@ or English, focusing on the number of miniwords and length of sentences.
The set of dictionary keys holding [ReadabilityTransform](dpk_readability/runtime.py) configuration for values are as follows:

* _readability_contents_column_name_ - specifies the name of the column holding the document text. The default is `text`.
* _readability_curriculum_ - set to True when the data is prepared for curriculum learning and is annotated with the `flesch_kincaid`, `gunning_fog`, `automated_readability_index` readability scores, and the average of these 3 grade-level scores to speed up the annotation process.
* _readability_score_list_ - list of readability scores to be computed by the transform;
valid values: `coleman_liau_index_textstat`, `flesch_kincaid_textstat`,
`difficult_words_textstat`, `spache_readability_textstat`, `smog_index_textstat`,
`reading_time_textstat`, `dale_chall_readability_score_textstat`, `text_standard_textstat`,
`automated_readability_index_textstat`, `gunning_fog_textstat`, `flesch_ease_textstat`,
`mcalpine_eflaw_textstat`, `linsear_write_formula_textstat`.


Additionally, a set of data access-specific arguments are provided that enable
the specification of the location of domain list files, so that these
files could be stored in the local file system or in S3 storage, for example.
The arguments are as follows (and generally match the TransformLauncher's
data access arguments but with the `extreme_tokenized_' prefix).
data access arguments but with the `readability_' prefix).

* _readability_local_config_ - specifies the input and output folders.
* _readability_s3_config_ - specifies the input and output paths in s3.
Expand All @@ -94,20 +100,20 @@ annotated `readability-test.parquet` file and the `metadata.json` file.
<pre>
cma:readability$ make venv PYTHON=python3.11
cma:readability$ source venv/bin/activate
(venv) cma:readability$ python -m dpk_readability.runtime --data_local_config "{ 'input_folder': 'test-data/input', 'output_folder': 'output' }"
12:07:23 INFO - Launching Readability transform
12:07:23 INFO - Readability parameters are : {'readability_contents_column_name': 'contents', 'readability_curriculum': False}
12:07:23 INFO - pipeline id pipeline_id
12:07:23 INFO - code location None
12:07:23 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output
12:07:23 INFO - data factory data_ max_files -1, n_sample -1
12:07:23 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
12:07:23 INFO - orchestrator readability started at 2025-01-28 12:07:23
12:07:23 INFO - Number of files is 1, source profile {'max_file_size': 0.014194488525390625, 'min_file_size': 0.014194488525390625, 'total_file_size': 0.014194488525390625}
12:07:23 INFO - Completed 1 files (100.0%) in 0.002 min
12:07:23 INFO - Done processing 1 files, waiting for flush() completion.
12:07:23 INFO - done flushing in 0.0 sec
12:07:23 INFO - Completed execution in 0.003 min, execution result 0
(venv) cma:readability$ python -m dpk_readability.runtime --data_local_config "{ 'input_folder': 'test-data/input', 'output_folder': 'output' }" --readability_score_list "['reading_time_textstat','spache_readability_textstat','text_standard_textstat']"
13:07:23 INFO - Launching Readability transform
13:07:23 INFO - Readability parameters are : {'readability_contents_column_name': 'contents', 'readability_score_list': ['reading_time_textstat', 'spache_readability_textstat', 'text_standard_textstat']}
13:07:23 INFO - pipeline id pipeline_id
13:07:23 INFO - code location None
13:07:23 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output
13:07:23 INFO - data factory data_ max_files -1, n_sample -1
13:07:23 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
13:07:23 INFO - orchestrator readability started at 2025-02-07 13:07:23
13:07:23 INFO - Number of files is 1, source profile {'max_file_size': 0.014194488525390625, 'min_file_size': 0.014194488525390625, 'total_file_size': 0.014194488525390625}
13:07:24 INFO - Completed 1 files (100.0%) in 0.002 min
13:07:24 INFO - Done processing 1 files, waiting for flush() completion.
13:07:24 INFO - done flushing in 0.0 sec
13:07:24 INFO - Completed execution in 0.002 min, execution result 0
(venv) cma:readability$ deactivate
</pre>

Expand All @@ -134,8 +140,8 @@ options:
-h, --help show this help message and exit
--readability_contents_column_name READABILITY_CONTENTS_COLUMN_NAME
contents column name for input parquet table to transform
--readability_curriculum READABILITY_CURRICULUM
curriculum parameter for transform; select True for curriculum learning
--readability_score_list READABILITY_SCORE_LIST
list of readability scores to be computed by the transform; valid values: {'flesch_ease_textstat', 'reading_time_textstat', 'flesch_kincaid_textstat', 'automated_readability_index_textstat', 'linsear_write_formula_textstat', 'text_standard_textstat', 'smog_index_textstat', 'difficult_words_textstat', 'spache_readability_textstat', 'dale_chall_readability_score_textstat', 'mcalpine_eflaw_textstat', 'gunning_fog_textstat', 'coleman_liau_index_textstat'}
--data_s3_cred DATA_S3_CRED
AST string of options for s3 credentials. Only required for S3 data access.
access_key: access key help text
Expand Down Expand Up @@ -181,3 +187,4 @@ options:
path: Path within the repository
Example: { 'github': 'https://github.com/somerepo', 'commit_hash': '1324',
'path': 'transforms/universal/code' }

14 changes: 6 additions & 8 deletions transforms/language/readability/dpk_readability/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,25 +63,23 @@
"""Key holds the mcalpine_eflaw_textstat R score threshold parameter"""
reading_time_textstat = "reading_time_textstat"
"""Key holds the reading_time_textstat R score threshold parameter"""
avg_grade_level = "avg_grade_level"
"""Key holds the avg_grade_level R score threshold parameter"""
contents_column_name = "contents_column_name"
"""Contents column name for the input parquet table to the transform"""
curriculum = "curriculum"
"""curriculum parameter for transform; either True or False"""
score_list = "score_list"
"""list of readability scores to be computed by the transform"""


########################################################################################
# CLI parameters corresponding to each config key
"""avg_grade_level R score threshold parameter"""
contents_column_name_cli_param = f"{cli_prefix}{contents_column_name}"
"""Content column name for parquet input table to transform"""
curriculum_cli_param = f"{cli_prefix}{curriculum}"
"""curriculum parameter for transform; either True or False"""
score_list_cli_param = f"{cli_prefix}{score_list}"
"""list of readability scores or a single readability scores to be computed by the transform"""


# The set of default value that can be overwritten from the CLI """
contents_column_name_default = "contents"
"""The default value for contents_column_name"""
curriculum_default = False
"""curriculum parameter for transform; either True or False"""
score_list_default = mcalpine_eflaw_textstat
"""readability score that is computed by default"""
55 changes: 49 additions & 6 deletions transforms/language/readability/dpk_readability/runtime.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@
# limitations under the License.
################################################################################

import argparse
import ast
import sys
from argparse import ArgumentParser, Namespace

Expand All @@ -21,12 +23,25 @@
from data_processing.transform import TransformConfiguration
from data_processing.utils import CLIArgumentProvider, ParamsUtils, get_logger, str2bool
from dpk_readability.common import (
automated_readability_index_textstat,
cli_prefix,
coleman_liau_index_textstat,
contents_column_name_cli_param,
contents_column_name_default,
curriculum_cli_param,
curriculum_default,
dale_chall_readability_score_textstat,
difficult_words_textstat,
flesch_ease_textstat,
flesch_kincaid_textstat,
gunning_fog_textstat,
linsear_write_formula_textstat,
mcalpine_eflaw_textstat,
reading_time_textstat,
score_list_cli_param,
score_list_default,
short_name,
smog_index_textstat,
spache_readability_textstat,
text_standard_textstat,
)
from dpk_readability.transform import ReadabilityTransform

Expand Down Expand Up @@ -54,19 +69,47 @@ def add_input_params(self, parser: ArgumentParser) -> None:
By convention a common prefix should be used for all transform-specific CLI args
(e.g, noop_, pii_, etc.)
"""
valid_values = {
flesch_ease_textstat,
flesch_kincaid_textstat,
gunning_fog_textstat,
smog_index_textstat,
coleman_liau_index_textstat,
automated_readability_index_textstat,
dale_chall_readability_score_textstat,
difficult_words_textstat,
linsear_write_formula_textstat,
text_standard_textstat,
spache_readability_textstat,
mcalpine_eflaw_textstat,
reading_time_textstat,
}

def validate_scores(x):
if x.startswith("[") and x.endswith("]"):
scores = ast.literal_eval(x)
if not all(score in valid_values for score in scores):
raise argparse.ArgumentTypeError(f"Invalid scores in list. Allowed scores: {valid_values}")
return scores
elif x in valid_values:
return x
else:
raise argparse.ArgumentTypeError(f"Invalid score: {x}. Allowed scores: {valid_values}")

parser.add_argument(
f"--{contents_column_name_cli_param}",
type=str,
required=False,
default=contents_column_name_default,
help="contents column name for input parquet table to transform",
)

parser.add_argument(
f"--{curriculum_cli_param}",
type=lambda x: bool(str2bool(x)),
f"--{score_list_cli_param}",
type=validate_scores,
required=False,
default=curriculum_default,
help="curriculum parameter for transform; select True for curriculum learning",
default=score_list_default,
help=f"list of readability scores to be computed by the transform; valid values: {valid_values}",
)

def apply_input_params(self, args: Namespace) -> bool:
Expand Down
Loading

0 comments on commit d991412

Please sign in to comment.