Skip to content

Commit

Permalink
Merge pull request #258 from JohnSnowLabs/161-release-candidate
Browse files Browse the repository at this point in the history
1.6.1 release candidate
  • Loading branch information
saif-ellafi authored Aug 9, 2018
2 parents 594342b + 4f6bd68 commit 2e9808a
Show file tree
Hide file tree
Showing 16 changed files with 144 additions and 244 deletions.
51 changes: 51 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,3 +1,54 @@
========
1.6.1
========
---------------
Overview
---------------
Hi! We're glad to announce new hotfix 1.6.1. Although changes seem modest or very specific, there is a lot going underground. First of all, we've worked hard with the community to understand S3-based clusters,
which don't have a common fs.defaultFS configuration, which is the one we use to tell where is the cluster temp folder located in order to distribute word embeddings. We fixed two things here,
on one side we fixed a bug pointing to the wrong filesystem. Second, we added a custom override setting in application.conf that allows manually setting where to put temp folders in cluster. This should help S3 users.
Please share your feedback on this regard.
On the other hand, we created a new annotator type internally. The CHUNK type allows better modulary in the communication between different annotators. Impact will be noticed implicitly and over time.

---------------
New features
---------------
* new Scala-only functions that make it easier to work with Annotations in Dataframes. May be imported through com.johnsnowlabs.nlp.functions._ and allow mapping and filtering within and outside Annotations.
filterByAnnotations, mapAnnotations and explodeAnnotations work by providing a column and a function. Check out documentation. Possibly later coming to Python.

---------------
Bug fixes
---------------
* Fixed incorrect filesystem readings in some S3 environments for word embeddings
* Fixed NerCRF not correctly training from CONLL, labeling everything as -O- (Thanks @arnound from Slack Channel)

---------------
Enhancements
---------------
* Added overrideable config sparknlp.settings.cluster_tmp_dir allows setting cluster location for temporary embeddings file. May help S3 based clusters with no fs.defaultFS set to a proper distributed storage.
* New annotator type: CHUNK. Representes a SUBSTRING of DOCUMENT and it is used as output from NerConverter, TextMatcher, RegexMatcher and other annotators that retrieve a substring from the original document.
This will make for better modularity and integration within various annotators, such as between NER and AssertionStatus.
* New annotation transformer: ChunkAssembler. Takes a string or array(string) column from a dataset and creates a CHUNK type annotator. The content must also belong to the current DOCUMENT annotation's content.
* SentenceDetector new param explodeSentences allow to explode sentences within a single row into different rows to increase parallelism and performance in some scenarios. Particularly OCR based.
* AssertionDLApproach now may be used within LightPipelines
* AssertionDLApproach and AssertionLogRegApproach now work from CHUNK type instead of start/end bounds. May still be trained with Start/end though. This means target for assertion may be any CHUNK output annotator now (e.g. RegexMatcher)

---------------
Other
---------------
* PerceptronApproachLegacy moved back to default PerceptronApproach. Distributed PerceptronApproach moved to PerceptronApproachDistributed due to not meeting accuracy expectations yet.
* Some configuration parameters in application.conf have been appropriately moved to proper annotator Params (NorvigSweeting Spell Checker, Vivekn Approach and Sentiment Detector affected)
* application.conf renamed configuration values for better consistency

---------------
Developer API
---------------
* Added beforeAnnotate() and afterAnnotate() to manipulate dataframes after or before calling annotate() UDF
* Added extraValidate() and extraValidateMsg() in all annotators to provide developer to add additional SCHEMA checks in transformSchema() stage
* Removed validation() stage in fit() stage. Allows for more flexible training when some of the columns are not really required yet.
* WrapColumnMetadata() will wrap an Annotation column with its appropriate Metadata. Makes it easier not to forget about Metadata in Schema.
* RawAnnotator trait has now all the basics needed to start a new Annotator without annotate() function. It is a complete previous stage before AnnotatorModel, which inherits from RawAnnotator.

========
1.6.0
========
Expand Down
28 changes: 14 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,18 @@ Questions? Feedback? Request access sending an email to [email protected]

This library has been uploaded to the spark-packages repository https://spark-packages.org/package/JohnSnowLabs/spark-nlp .

To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.6.0` to you spark command
To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.6.1` to you spark command

```sh
spark-shell --packages JohnSnowLabs:spark-nlp:1.6.0
spark-shell --packages JohnSnowLabs:spark-nlp:1.6.1
```

```sh
pyspark --packages JohnSnowLabs:spark-nlp:1.6.0
pyspark --packages JohnSnowLabs:spark-nlp:1.6.1
```

```sh
spark-submit --packages JohnSnowLabs:spark-nlp:1.6.0
spark-submit --packages JohnSnowLabs:spark-nlp:1.6.1
```

## Jupyter Notebook
Expand All @@ -35,23 +35,23 @@ export SPARK_HOME=/path/to/your/spark/folder
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages JohnSnowLabs:spark-nlp:1.6.0
pyspark --packages JohnSnowLabs:spark-nlp:1.6.1
```

## Apache Zeppelin
This way will work for both Scala and Python
```
export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.6.0"
export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.6.1"
```
Alternatively, add the following Maven Coordinates to the interpreter's library list
```
com.johnsnowlabs.nlp:spark-nlp_2.11:1.6.0
com.johnsnowlabs.nlp:spark-nlp_2.11:1.6.1
```

## Python without explicit Spark installation
If you installed pyspark through pip, you can now install sparknlp through pip
```
pip install --index-url https://test.pypi.org/simple/ spark-nlp==1.6.0
pip install --index-url https://test.pypi.org/simple/ spark-nlp==1.6.1
```
Then you'll have to create a SparkSession manually, for example:
```
Expand All @@ -67,11 +67,11 @@ spark = SparkSession.builder \

## Pre-compiled Spark-NLP and Spark-NLP-OCR
You may download fat-jar from here:
[Spark-NLP 1.6.0 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/spark-nlp-assembly-1.6.0.jar)
[Spark-NLP 1.6.1 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/spark-nlp-assembly-1.6.1.jar)
or non-fat from here
[Spark-NLP 1.6.0 PKG JAR](http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.6.0/spark-nlp_2.11-1.6.0.jar)
[Spark-NLP 1.6.1 PKG JAR](http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.6.1/spark-nlp_2.11-1.6.1.jar)
Spark-NLP-OCR Module (Requires native Tesseract 4.x+ for image based OCR. Does not require Spark-NLP to work but highly suggested)
[Spark-NLP-OCR 1.6.0 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/spark-nlp-ocr-assembly-1.6.0.jar)
[Spark-NLP-OCR 1.6.1 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/spark-nlp-ocr-assembly-1.6.1.jar)

## Maven central

Expand All @@ -83,19 +83,19 @@ Our package is deployed to maven central. In order to add this package as a depe
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.11</artifactId>
<version>1.6.0</version>
<version>1.6.1</version>
</dependency>
```

#### SBT
```sbtshell
libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.6.0"
libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.6.1"
```

If you are using `scala 2.11`

```sbtshell
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.6.0"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.6.1"
```

## Using the jar manually
Expand Down
4 changes: 2 additions & 2 deletions build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ name := "spark-nlp"

organization := "com.johnsnowlabs.nlp"

version := "1.6.0"
version := "1.6.1"

scalaVersion in ThisBuild := scalaVer

Expand Down Expand Up @@ -137,7 +137,7 @@ assemblyMergeStrategy in assembly := {
lazy val ocr = (project in file("ocr"))
.settings(
name := "spark-nlp-ocr",
version := "1.6.0",
version := "1.6.1",
libraryDependencies ++= ocrDependencies ++
analyticsDependencies ++
testDependencies,
Expand Down
4 changes: 2 additions & 2 deletions docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -78,8 +78,8 @@ <h2 class="title">High Performance NLP with Apache Spark </h2>
</p>
<a class="btn btn-info btn-cta" style="float: center;margin-top: 10px;" href="mailto:[email protected]?subject=SparkNLP%20Slack%20access" target="_blank"> Questions? Join our Slack</a>
<b/><p/><p/>
<p><span class="label label-warning">2018 Jul 7th - Update!</span> 1.6.0 Released! OCR PDF to Spark-NLP capabilities, new Chunker annotator, fixed AWS compatibility, better performance and much more.
Learn changes <a href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/CHANGELOG">HERE</a> and check out for updated documentation below</p>
<p><span class="label label-warning">2018 Aug 9th - Update!</span> 1.6.1 Released! Fixed S3-based clusters support, new CHUNK type annotation and more!
Learn changes <a href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/CHANGELOG">HERE</a> and check out for updated documentation below</p>
</div>
<div id="cards-wrapper" class="cards-wrapper row">
<div class="item item-green col-md-4 col-sm-6 col-xs-6">
Expand Down
18 changes: 9 additions & 9 deletions docs/notebooks.html
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ <h4 id="scala-vivekn-notebook" class="section-block"> Vivekn Sentiment Analysis<
Since we are dealing with small amounts of data, we put in practice LightPipelines.
</p>
<p>
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/example/src/TrainViveknSentiment.scala" target="_blank"> Take me to notebook!</a>
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/example/src/TrainViveknSentiment.scala" target="_blank"> Take me to notebook!</a>
</p>
</div>
</section>
Expand Down Expand Up @@ -135,7 +135,7 @@ <h4 id="vivekn-notebook" class="section-block"> Vivekn Sentiment Analysis</h4>
better Sentiment Analysis accuracy
</p>
<p>
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/vivekn-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/vivekn-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
</p>
</div>
<div>
Expand All @@ -157,7 +157,7 @@ <h4 id="sentiment-notebook" class="section-block"> Rule-based Sentiment Analysis
Each of these sentences will be used for giving a score to text
</p>
</p>
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/dictionary-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/dictionary-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
</p>
</div>
<div>
Expand All @@ -177,7 +177,7 @@ <h4 id="crfner-notebook" class="section-block"> CRF Named Entity Recognition</h4
approach to use the same pipeline for tagging external resources.
</p>
<p>
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/crf-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/crf-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
</p>
</div>
<div>
Expand All @@ -196,7 +196,7 @@ <h4 id="dlner-notebook" class="section-block"> CNN Deep Learning NER</h4>
and it will leverage batch-based distributed calls to native TensorFlow libraries during prediction.
</p>
<p>
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/dl-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/dl-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
</p>
</div>
<div>
Expand All @@ -211,7 +211,7 @@ <h4 id="text-notebook" class="section-block"> Simple Text Matching</h4>
This annotator is an AnnotatorModel and does not require training.
</p>
<p>
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/text-matcher/extractor.ipynb" target="_blank"> Take me to notebook!</a>
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/text-matcher/extractor.ipynb" target="_blank"> Take me to notebook!</a>
</p>
</div>
<div>
Expand All @@ -226,7 +226,7 @@ <h4 id="assertion-notebook" class="section-block"> Assertion Status with LogReg<
dataset will return the appropriate result.
</p>
<p>
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/logreg-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/logreg-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
</p>
</div>
<div>
Expand All @@ -241,7 +241,7 @@ <h4 id="dlassertion-notebook" class="section-block"> Deep Learning Assertion Sta
graphs may be redesigned if needed.
</p>
<p>
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/dl-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/dl-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
</p>
</div>
<div>
Expand All @@ -260,7 +260,7 @@ <h4 id="downloader-notebook" class="section-block"> Retrieving Pretrained models
Such components may then be injected seamlessly into further pipelines, and so on.
</p>
<p>
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/model-downloader/ModelDownloaderExample.ipynb" target="_blank"> Take me to notebook!</a>
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.1/python/example/model-downloader/ModelDownloaderExample.ipynb" target="_blank"> Take me to notebook!</a>
</p>
</div>
</section>
Expand Down
Loading

0 comments on commit 2e9808a

Please sign in to comment.