Merge pull request #1026 from IBM/perf-readability

Readability transform: performance improvement and adding score_list argument
IBM · Feb 11, 2025 · d991412 · d991412
2 parents e3ce06e + 5c19c86
commit d991412
Show file tree

Hide file tree

Showing 14 changed files with 942 additions and 6,394 deletions.
diff --git a/transforms/language/readability/README.md b/transforms/language/readability/README.md
@@ -72,13 +72,19 @@ or English, focusing on the number of miniwords and length of sentences.
 The set of dictionary keys holding [ReadabilityTransform](dpk_readability/runtime.py) configuration for values are as follows:
 
 * _readability_contents_column_name_ - specifies the name of the column holding the document text. The default is `text`.
-* _readability_curriculum_ - set to True when the data is prepared for curriculum learning and is annotated with the `flesch_kincaid`, `gunning_fog`, `automated_readability_index` readability scores, and the average of these 3 grade-level scores to speed up the annotation process.
+* _readability_score_list_ - list of readability scores to be computed by the transform;
+  valid values: `coleman_liau_index_textstat`, `flesch_kincaid_textstat`,
+  `difficult_words_textstat`, `spache_readability_textstat`, `smog_index_textstat`,
+  `reading_time_textstat`, `dale_chall_readability_score_textstat`, `text_standard_textstat`,
+  `automated_readability_index_textstat`, `gunning_fog_textstat`, `flesch_ease_textstat`,
+  `mcalpine_eflaw_textstat`, `linsear_write_formula_textstat`.
+
 
 Additionally, a set of data access-specific arguments are provided that enable
 the specification of the location of domain list files, so that these
 files could be stored in the local file system or in S3 storage, for example.
 The arguments are as follows (and generally match the TransformLauncher's 
-data access arguments but with the `extreme_tokenized_' prefix).
+data access arguments but with the `readability_' prefix).
 
 * _readability_local_config_ - specifies the input and output folders.
 * _readability_s3_config_ - specifies the input and output paths in s3.
@@ -94,20 +100,20 @@ annotated `readability-test.parquet` file and the `metadata.json` file.
 <pre>
 cma:readability$ make venv PYTHON=python3.11
 cma:readability$ source venv/bin/activate
-(venv) cma:readability$ python -m dpk_readability.runtime --data_local_config "{ 'input_folder': 'test-data/input', 'output_folder': 'output' }"
-12:07:23 INFO - Launching Readability transform
-12:07:23 INFO - Readability parameters are : {'readability_contents_column_name': 'contents', 'readability_curriculum': False}
-12:07:23 INFO - pipeline id pipeline_id
-12:07:23 INFO - code location None
-12:07:23 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output
-12:07:23 INFO - data factory data_ max_files -1, n_sample -1
-12:07:23 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
-12:07:23 INFO - orchestrator readability started at 2025-01-28 12:07:23
-12:07:23 INFO - Number of files is 1, source profile {'max_file_size': 0.014194488525390625, 'min_file_size': 0.014194488525390625, 'total_file_size': 0.014194488525390625}
-12:07:23 INFO - Completed 1 files (100.0%) in 0.002 min
-12:07:23 INFO - Done processing 1 files, waiting for flush() completion.
-12:07:23 INFO - done flushing in 0.0 sec
-12:07:23 INFO - Completed execution in 0.003 min, execution result 0
+(venv) cma:readability$ python -m dpk_readability.runtime --data_local_config "{ 'input_folder': 'test-data/input', 'output_folder': 'output' }" --readability_score_list "['reading_time_textstat','spache_readability_textstat','text_standard_textstat']"
+13:07:23 INFO - Launching Readability transform
+13:07:23 INFO - Readability parameters are : {'readability_contents_column_name': 'contents', 'readability_score_list': ['reading_time_textstat', 'spache_readability_textstat', 'text_standard_textstat']}
+13:07:23 INFO - pipeline id pipeline_id
+13:07:23 INFO - code location None
+13:07:23 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output
+13:07:23 INFO - data factory data_ max_files -1, n_sample -1
+13:07:23 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
+13:07:23 INFO - orchestrator readability started at 2025-02-07 13:07:23
+13:07:23 INFO - Number of files is 1, source profile {'max_file_size': 0.014194488525390625, 'min_file_size': 0.014194488525390625, 'total_file_size': 0.014194488525390625}
+13:07:24 INFO - Completed 1 files (100.0%) in 0.002 min
+13:07:24 INFO - Done processing 1 files, waiting for flush() completion.
+13:07:24 INFO - done flushing in 0.0 sec
+13:07:24 INFO - Completed execution in 0.002 min, execution result 0
 (venv) cma:readability$ deactivate
 </pre>
 
@@ -134,8 +140,8 @@ options:
   -h, --help            show this help message and exit
   --readability_contents_column_name READABILITY_CONTENTS_COLUMN_NAME
                         contents column name for input parquet table to transform
-  --readability_curriculum READABILITY_CURRICULUM
-                        curriculum parameter for transform; select True for curriculum learning
+  --readability_score_list READABILITY_SCORE_LIST
+                        list of readability scores to be computed by the transform; valid values: {'flesch_ease_textstat', 'reading_time_textstat', 'flesch_kincaid_textstat', 'automated_readability_index_textstat', 'linsear_write_formula_textstat', 'text_standard_textstat', 'smog_index_textstat', 'difficult_words_textstat', 'spache_readability_textstat', 'dale_chall_readability_score_textstat', 'mcalpine_eflaw_textstat', 'gunning_fog_textstat', 'coleman_liau_index_textstat'}
   --data_s3_cred DATA_S3_CRED
                         AST string of options for s3 credentials. Only required for S3 data access.
                         access_key: access key help text
@@ -181,3 +187,4 @@ options:
                         path: Path within the repository
                         Example: { 'github': 'https://github.com/somerepo', 'commit_hash': '1324', 
                         'path': 'transforms/universal/code' }
+
diff --git a/transforms/language/readability/dpk_readability/common.py b/transforms/language/readability/dpk_readability/common.py
@@ -63,25 +63,23 @@
 """Key holds the mcalpine_eflaw_textstat R score threshold parameter"""
 reading_time_textstat = "reading_time_textstat"
 """Key holds the reading_time_textstat R score threshold parameter"""
-avg_grade_level = "avg_grade_level"
-"""Key holds the avg_grade_level R score threshold parameter"""
 contents_column_name = "contents_column_name"
 """Contents column name for the input parquet table to the transform"""
-curriculum = "curriculum"
-"""curriculum parameter for transform; either True or False"""
+score_list = "score_list"
+"""list of readability scores to be computed by the transform"""
 
 
 ########################################################################################
 # CLI parameters corresponding to each config key
 """avg_grade_level R score threshold parameter"""
 contents_column_name_cli_param = f"{cli_prefix}{contents_column_name}"
 """Content column name for parquet input table to transform"""
-curriculum_cli_param = f"{cli_prefix}{curriculum}"
-"""curriculum parameter for transform; either True or False"""
+score_list_cli_param = f"{cli_prefix}{score_list}"
+"""list of readability scores or a single readability scores to be computed by the transform"""
 
 
 # The set of default value that can be overwritten from the CLI """
 contents_column_name_default = "contents"
 """The default value for contents_column_name"""
-curriculum_default = False
-"""curriculum parameter for transform; either True or False"""
+score_list_default = mcalpine_eflaw_textstat
+"""readability score that is computed by default"""
diff --git a/transforms/language/readability/dpk_readability/runtime.py b/transforms/language/readability/dpk_readability/runtime.py
@@ -10,6 +10,8 @@
 # limitations under the License.
 ################################################################################
 
+import argparse
+import ast
 import sys
 from argparse import ArgumentParser, Namespace
 
@@ -21,12 +23,25 @@
 from data_processing.transform import TransformConfiguration
 from data_processing.utils import CLIArgumentProvider, ParamsUtils, get_logger, str2bool
 from dpk_readability.common import (
+    automated_readability_index_textstat,
     cli_prefix,
+    coleman_liau_index_textstat,
     contents_column_name_cli_param,
     contents_column_name_default,
-    curriculum_cli_param,
-    curriculum_default,
+    dale_chall_readability_score_textstat,
+    difficult_words_textstat,
+    flesch_ease_textstat,
+    flesch_kincaid_textstat,
+    gunning_fog_textstat,
+    linsear_write_formula_textstat,
+    mcalpine_eflaw_textstat,
+    reading_time_textstat,
+    score_list_cli_param,
+    score_list_default,
     short_name,
+    smog_index_textstat,
+    spache_readability_textstat,
+    text_standard_textstat,
 )
 from dpk_readability.transform import ReadabilityTransform
 
@@ -54,19 +69,47 @@ def add_input_params(self, parser: ArgumentParser) -> None:
         By convention a common prefix should be used for all transform-specific CLI args
         (e.g, noop_, pii_, etc.)
         """
+        valid_values = {
+            flesch_ease_textstat,
+            flesch_kincaid_textstat,
+            gunning_fog_textstat,
+            smog_index_textstat,
+            coleman_liau_index_textstat,
+            automated_readability_index_textstat,
+            dale_chall_readability_score_textstat,
+            difficult_words_textstat,
+            linsear_write_formula_textstat,
+            text_standard_textstat,
+            spache_readability_textstat,
+            mcalpine_eflaw_textstat,
+            reading_time_textstat,
+        }
+
+        def validate_scores(x):
+            if x.startswith("[") and x.endswith("]"):
+                scores = ast.literal_eval(x)
+                if not all(score in valid_values for score in scores):
+                    raise argparse.ArgumentTypeError(f"Invalid scores in list. Allowed scores: {valid_values}")
+                return scores
+            elif x in valid_values:
+                return x
+            else:
+                raise argparse.ArgumentTypeError(f"Invalid score: {x}. Allowed scores: {valid_values}")
+
         parser.add_argument(
             f"--{contents_column_name_cli_param}",
             type=str,
             required=False,
             default=contents_column_name_default,
             help="contents column name for input parquet table to transform",
         )
+
         parser.add_argument(
-            f"--{curriculum_cli_param}",
-            type=lambda x: bool(str2bool(x)),
+            f"--{score_list_cli_param}",
+            type=validate_scores,
             required=False,
-            default=curriculum_default,
-            help="curriculum parameter for transform; select True for curriculum learning",
+            default=score_list_default,
+            help=f"list of readability scores to be computed by the transform; valid values: {valid_values}",
         )
 
     def apply_input_params(self, args: Namespace) -> bool: