From 908e5a674d2a3b4049004e62762c944d644dc14b Mon Sep 17 00:00:00 2001 From: Saif Addin Date: Mon, 20 Aug 2018 13:19:04 -0300 Subject: [PATCH 1/2] SentenceDetection improvements --- CHANGELOG | 44 +++++++++++++++++++ README.md | 30 ++++++------- build.sbt | 4 +- docs/index.html | 4 +- docs/notebooks.html | 18 ++++---- docs/quickstart.html | 24 +++++----- python/setup.py | 2 +- .../scala/com/johnsnowlabs/util/Build.scala | 2 +- .../PragmaticDetectionPerfTest.scala | 2 +- 9 files changed, 87 insertions(+), 43 deletions(-) diff --git a/CHANGELOG b/CHANGELOG index ed3101a70806c6..cf77977927690c 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -1,3 +1,47 @@ +======== +1.6.2 +======== +--------------- +Overview +--------------- +In this release, we focused on reviewing out streaming performance, buy measuring our amount of sentences processed by second, through a LightPipeline. +We increased Norvig Spell Checker by more than 300% by disabling DoubleVariants and improving algorithm orders. It is now reported capable of 42K sentences per second. +Symmetric Delete Spell checker is more performance, although it has been reported to process 2K sentences per second. +NerCRF has been reported to process 300 hundred sentences per second, while NerDL can do twice fast (about 700 sentences per second). +Vivekn Sentiment Analysis was improved and is now capable to processing 100K sentences per sentence (before it was below 500). +Finally, SentenceDetector performance was improved by a 40% from ~30K rows processed per second to ~40K. But, we have now enabled Abbreviation processing by default which reduces final speed to 22K rows per second with a negative net but better accuracy. +Again, thanks for the community for helping with feedback. We welcome everyone asking questions or giving feedback in our Slack channel or reporting issues on Github. + +--------------- +Enhancements +--------------- +* OCR now features kernel segmentation. Significantly improves image based PDF processing +* Vivekn Sentiment Analysis prediction performance improved by better data structures +* Both Norvig and Symmetric Delete spell checkers now have improved performance +* SentenceDetector improved accuracy by better handling abbreviations. UseAbbreviations now also by default turned ON +* SentenceDetector improved performance significantly by improved preloading of rules + +--------------- +Bug fixes +--------------- +* Fixed NerDL not training correctly (broken since 1.6.0). Pretrained models not affected +* Fixed NerConverter not properly considering multiple sentences per row (after using SentenceDetector), causing an unhandled exception to occur in some scenarios. +* Tensorflow sessions now all support allow_soft_placement, supporting GPU based graphs to work with and without GPU +* Norvig Spell Checker fixed a missing step from the algorithm to check for additional variants. May improve accuracy +* Norvig Spell Checker disabled DoubleVariants by default. Was not improving accuracy significantly and was hitting performance very hard + +--------------- +Developer API +--------------- +* New FeatureSet allows HashSet params + +--------------- +Models +--------------- +* Vivekn Sentiment Pipeline doesn't have Spell Checker anymore +* Fixed Vivekn Sentiment pretrained improved accuracy + + ======== 1.6.1 ======== diff --git a/README.md b/README.md index cef6ad97ba4386..a70426d2d30d31 100644 --- a/README.md +++ b/README.md @@ -14,18 +14,18 @@ Questions? Feedback? Request access sending an email to nlp@johnsnowlabs.com This library has been uploaded to the spark-packages repository https://spark-packages.org/package/JohnSnowLabs/spark-nlp . -To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.6.1` to you spark command +To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.6.2` to you spark command ```sh -spark-shell --packages JohnSnowLabs:spark-nlp:1.6.1 +spark-shell --packages JohnSnowLabs:spark-nlp:1.6.2 ``` ```sh -pyspark --packages JohnSnowLabs:spark-nlp:1.6.1 +pyspark --packages JohnSnowLabs:spark-nlp:1.6.2 ``` ```sh -spark-submit --packages JohnSnowLabs:spark-nlp:1.6.1 +spark-submit --packages JohnSnowLabs:spark-nlp:1.6.2 ``` ## Jupyter Notebook @@ -35,23 +35,23 @@ export SPARK_HOME=/path/to/your/spark/folder export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS=notebook -pyspark --packages JohnSnowLabs:spark-nlp:1.6.1 +pyspark --packages JohnSnowLabs:spark-nlp:1.6.2 ``` ## Apache Zeppelin This way will work for both Scala and Python ``` -export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.6.1" +export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.6.2" ``` Alternatively, add the following Maven Coordinates to the interpreter's library list ``` -com.johnsnowlabs.nlp:spark-nlp_2.11:1.6.1 +com.johnsnowlabs.nlp:spark-nlp_2.11:1.6.2 ``` ## Python without explicit Spark installation If you installed pyspark through pip, you can now install sparknlp through pip ``` -pip install --index-url https://test.pypi.org/simple/ spark-nlp==1.6.1 +pip install --index-url https://test.pypi.org/simple/ spark-nlp==1.6.2 ``` Then you'll have to create a SparkSession manually, for example: ``` @@ -84,11 +84,11 @@ sparknlp { ## Pre-compiled Spark-NLP and Spark-NLP-OCR You may download fat-jar from here: -[Spark-NLP 1.6.1 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/spark-nlp-assembly-1.6.1.jar) +[Spark-NLP 1.6.2 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/spark-nlp-assembly-1.6.2.jar) or non-fat from here -[Spark-NLP 1.6.1 PKG JAR](http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.6.1/spark-nlp_2.11-1.6.1.jar) +[Spark-NLP 1.6.2 PKG JAR](http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.6.2/spark-nlp_2.11-1.6.2.jar) Spark-NLP-OCR Module (Requires native Tesseract 4.x+ for image based OCR. Does not require Spark-NLP to work but highly suggested) -[Spark-NLP-OCR 1.6.1 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/spark-nlp-ocr-assembly-1.6.1.jar) +[Spark-NLP-OCR 1.6.2 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/spark-nlp-ocr-assembly-1.6.2.jar) ## Maven central @@ -100,19 +100,19 @@ Our package is deployed to maven central. In order to add this package as a depe com.johnsnowlabs.nlp spark-nlp_2.11 - 1.6.1 + 1.6.2 ``` #### SBT ```sbtshell -libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.6.1" +libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.6.2" ``` If you are using `scala 2.11` ```sbtshell -libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.6.1" +libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.6.2" ``` ## Using the jar manually @@ -133,7 +133,7 @@ The preferred way to use the library when running spark programs is using the `- If you have troubles using pretrained() models in your environment, here a list to various models (only valid for latest versions). If there is any older than current version of a model, it means they still work for current versions. -### Updated for 1.6.1 +### Updated for 1.6.2 ### Pipelines * [Basic Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_basic_en_1.6.1_2_1533856444797.zip) * [Advanced Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_advanced_en_1.6.1_2_1533856478690.zip) diff --git a/build.sbt b/build.sbt index 60247e81b18ea5..959abad3c63177 100644 --- a/build.sbt +++ b/build.sbt @@ -9,7 +9,7 @@ name := "spark-nlp" organization := "com.johnsnowlabs.nlp" -version := "1.6.1" +version := "1.6.2" scalaVersion in ThisBuild := scalaVer @@ -138,7 +138,7 @@ assemblyMergeStrategy in assembly := { lazy val ocr = (project in file("ocr")) .settings( name := "spark-nlp-ocr", - version := "1.6.1", + version := "1.6.2", libraryDependencies ++= ocrDependencies ++ analyticsDependencies ++ testDependencies, diff --git a/docs/index.html b/docs/index.html index 7075d9a7191c91..fdb2ada03a2e55 100644 --- a/docs/index.html +++ b/docs/index.html @@ -78,8 +78,8 @@

High Performance NLP with Apache Spark

Questions? Join our Slack

-

2018 Aug 9th - Update! 1.6.1 Released! Fixed S3-based clusters support, new CHUNK type annotation and more! - Learn changes HERE and check out for updated documentation below

+

2018 Aug 20th - Update! 1.6.2 Released! Annotation performance revisited! Check our changelog + Learn changes HERE and check out for updated documentation below

diff --git a/docs/notebooks.html b/docs/notebooks.html index 384797b1a3bb52..f83cfc671bce01 100644 --- a/docs/notebooks.html +++ b/docs/notebooks.html @@ -103,7 +103,7 @@

Vivekn Sentiment Analysis< Since we are dealing with small amounts of data, we put in practice LightPipelines.

- Take me to notebook! + Take me to notebook!

@@ -135,7 +135,7 @@

Vivekn Sentiment Analysis

better Sentiment Analysis accuracy

- Take me to notebook! + Take me to notebook!

@@ -157,7 +157,7 @@

Rule-based Sentiment Analysis Each of these sentences will be used for giving a score to text

- Take me to notebook! + Take me to notebook!

@@ -177,7 +177,7 @@

CRF Named Entity Recognition

- Take me to notebook! + Take me to notebook!

@@ -196,7 +196,7 @@

CNN Deep Learning NER

and it will leverage batch-based distributed calls to native TensorFlow libraries during prediction.

- Take me to notebook! + Take me to notebook!

@@ -211,7 +211,7 @@

Simple Text Matching

This annotator is an AnnotatorModel and does not require training.

- Take me to notebook! + Take me to notebook!

@@ -226,7 +226,7 @@

Assertion Status with LogReg< dataset will return the appropriate result.

- Take me to notebook! + Take me to notebook!

@@ -241,7 +241,7 @@

Deep Learning Assertion Sta graphs may be redesigned if needed.

- Take me to notebook! + Take me to notebook!

@@ -260,7 +260,7 @@

Retrieving Pretrained models Such components may then be injected seamlessly into further pipelines, and so on.

- Take me to notebook! + Take me to notebook!

diff --git a/docs/quickstart.html b/docs/quickstart.html index 95bc2e6b2422b8..cde07a225cce6b 100644 --- a/docs/quickstart.html +++ b/docs/quickstart.html @@ -95,35 +95,35 @@

Requirements & Setup

To start using the library, execute any of the following lines depending on your desired use case:

-
spark-shell --packages JohnSnowLabs:spark-nlp:1.6.1
-pyspark --packages JohnSnowLabs:spark-nlp:1.6.1
-spark-submit --packages JohnSnowLabs:spark-nlp:1.6.1
+                                
spark-shell --packages JohnSnowLabs:spark-nlp:1.6.2
+pyspark --packages JohnSnowLabs:spark-nlp:1.6.2
+spark-submit --packages JohnSnowLabs:spark-nlp:1.6.2
 
NOTE: Spark packages --packages has been reported to work unproperly, particularly in python, when utilizing physical clusters. Utilizing --jars is advised. For python, add python Spark-NLP through pip

Databricks cloud cluster & Apache Zeppelin

-
com.johnsnowlabs.nlp:spark-nlp_2.11:1.6.1
+
com.johnsnowlabs.nlp:spark-nlp_2.11:1.6.2

For Python in Apache Zeppelin you may need to setup SPARK_SUBMIT_OPTIONS utilizing --packages instruction shown above like this

-
export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.6.1"
+
export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.6.2"

Python Jupyter Notebook with PySpark

export SPARK_HOME=/path/to/your/spark/folder
 export PYSPARK_DRIVER_PYTHON=jupyter
 export PYSPARK_DRIVER_PYTHON_OPTS=notebook
 
-pyspark --packages JohnSnowLabs:spark-nlp:1.6.1
+pyspark --packages JohnSnowLabs:spark-nlp:1.6.2

Python without explicit Spark Installation

Use pip to install (after you pip installed pyspark)

-
pip install --index-url https://test.pypi.org/simple/ spark-nlp==1.6.1
+
pip install --index-url https://test.pypi.org/simple/ spark-nlp==1.6.2

In this way, you will have to start SparkSession in your python program manually, this is an example

spark = SparkSession.builder \
     .appName("ner")\
     .master("local[*]")\
     .config("spark.driver.memory","4G")\
     .config("spark.driver.maxResultSize", "2G") \
-    .config("spark.driver.extraClassPath", "lib/spark-nlp-assembly-1.6.1.jar")\
+    .config("spark.driver.extraClassPath", "lib/spark-nlp-assembly-1.6.2.jar")\
     .config("spark.kryoserializer.buffer.max", "500m")\
     .getOrCreate()

S3 based standalone cluster (No Hadoop)

@@ -145,11 +145,11 @@

S3 based standalone cluster (No Hadoop)

Pre-Compiled Spark-NLP for download

Pre-compiled Spark-NLP assembly fat-jar for using in standalone projects, may be downloaded - here + here Non-fat-jar may be downloaded - here + here then, run spark-shell or spark-submit with appropriate --jars - /path/to/spark-nlp_2.11-1.6.1.jar to use the library in spark. + /path/to/spark-nlp_2.11-1.6.2.jar to use the library in spark.

For further alternatives and documentation check out our README page in GitHub. @@ -435,7 +435,7 @@

Utilizing Spark-NLP OCR PDF Converter

Installing Spark-NLP OCRHelper

First, either build from source or download the following standalone jar module (works both from Spark-NLP python and scala): - Spark-NLP-OCR + Spark-NLP-OCR And add it to your Spark environment (with --jars or spark.driver.extraClassPath and spark.executor.extraClassPath configuration) Second, if your PDFs don't have a text layer (this depends on how PDFs were created), the library will use Tesseract 4.0 on background. Tesseract will utilize native libraries, so you'll have to get them installed in your system. diff --git a/python/setup.py b/python/setup.py index 3ba288838de207..f410866df6268c 100644 --- a/python/setup.py +++ b/python/setup.py @@ -40,7 +40,7 @@ # For a discussion on single-sourcing the version across setup.py and the # project code, see # https://packaging.python.org/en/latest/single_source_version.html - version='1.6.1', # Required + version='1.6.2', # Required # This is a one-line description or tagline of what your project does. This # corresponds to the "Summary" metadata field: diff --git a/src/main/scala/com/johnsnowlabs/util/Build.scala b/src/main/scala/com/johnsnowlabs/util/Build.scala index 9acdbdefd5829d..0c941c6053ca1d 100644 --- a/src/main/scala/com/johnsnowlabs/util/Build.scala +++ b/src/main/scala/com/johnsnowlabs/util/Build.scala @@ -11,6 +11,6 @@ object Build { if (version != null && version.nonEmpty) version else - "1.6.1" + "1.6.2" } } \ No newline at end of file diff --git a/src/test/scala/com/johnsnowlabs/nlp/annotators/sbd/pragmatic/PragmaticDetectionPerfTest.scala b/src/test/scala/com/johnsnowlabs/nlp/annotators/sbd/pragmatic/PragmaticDetectionPerfTest.scala index fb869cc7b60cc2..b32955e785d76b 100644 --- a/src/test/scala/com/johnsnowlabs/nlp/annotators/sbd/pragmatic/PragmaticDetectionPerfTest.scala +++ b/src/test/scala/com/johnsnowlabs/nlp/annotators/sbd/pragmatic/PragmaticDetectionPerfTest.scala @@ -8,7 +8,7 @@ import org.scalatest._ class PragmaticDetectionPerfTest extends FlatSpec { - "sentence detection" should "be fast" ignore { + "sentence detection" should "be fast" in { ResourceHelper.spark import ResourceHelper.spark.implicits._ From 0a55b3fe8f05376429aec7210966b87113f6f9a9 Mon Sep 17 00:00:00 2001 From: Saif Addin Date: Mon, 20 Aug 2018 13:36:54 -0300 Subject: [PATCH 2/2] Updated model links --- README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index a70426d2d30d31..bc3f2bad7de284 100644 --- a/README.md +++ b/README.md @@ -136,14 +136,14 @@ If there is any older than current version of a model, it means they still work ### Updated for 1.6.2 ### Pipelines * [Basic Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_basic_en_1.6.1_2_1533856444797.zip) -* [Advanced Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_advanced_en_1.6.1_2_1533856478690.zip) -* [Vivekn Sentiment Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_vivekn_en_1.6.1_2_1533942424443.zip) +* [Advanced Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_advanced_en_1.6.2_2_1534781366259.zip) +* [Vivekn Sentiment Pipeline](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pipeline_vivekn_en_1.6.2_2_1534781342094.zip) ### Models * [PerceptronModel (POS)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/pos_fast_en_1.6.1_2_1533853928168.zip) -* [ViveknSentimentModel (Sentiment)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/vivekn_fast_en_1.6.1_2_1533942419063.zip) -* [SymmetricDeleteModel (Spell Checker)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spell_sd_fast_en_1.6.1_2_1533854712643.zip) -* [NorvigSweetingModel (Spell Checker)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spell_fast_en_1.6.1_2_1533854544551.zip) +* [ViveknSentimentModel (Sentiment)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/vivekn_fast_en_1.6.2_2_1534781337758.zip) +* [SymmetricDeleteModel (Spell Checker)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spell_sd_fast_en_1.6.2_2_1534781178138.zip) +* [NorvigSweetingModel (Spell Checker)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/spell_fast_en_1.6.2_2_1534781328404.zip) * [AssertionDLModel (Assertion Status)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/as_fast_dl_en_1.6.1_2_1533855787457.zip) * [NerCRFModel (NER)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_fast_en_1.6.1_2_1533854463219.zip) * [LemmatizerModel (Lemmatizer)](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/lemma_fast_en_1.6.1_2_1533854538211.zip)