Releases: JohnSnowLabs/spark-nlp
Spark NLP 4.0.2: Over 620 new state-of-the-art models in 21 languages, full support for Apache Spark 3.3.0, new Databricks runtime 11.1, and bug fixes
Overview
We are pleased to release Spark NLP π 4.0.2! This release comes with full compatibility with the newly-released Apache Spark 3.3.0 and official support for Databrick's new runtimes 11.1 Beta (includes Apache Spark 3.3.0, Scala 2.12).
As always, we would like to thank our community for their feedback, questions, and feature requests.
New Features
- Welcoming new Databricks runtimes based on Spark/PySpark 3.3.0 to our Spark NLP family:
- Databricks 11.1 Beta
- Databricks 11.1 ML Berta
- Databricks 11.1 ML Berta GPU
SentenceDetector
now comes with a new parametercustomBoundsStrategy
for returning custom bounds #10567
Example
with setCustomBounds([r"\.", ";"])
This is a sentence. This one uses custom bounds; As is this one;
Without the flags will result in
["This is a sentence", "This one uses custom bounds", "As is this one"]
With the new flag:
.setCustomBounds([r"\.", ";"])
.setCustomBoundsStrategy("append")
the result will be
["This is a sentence.", "This one uses custom bounds;", "As is this one;"]
Similarly with prepend:
1. This is a list
1.1 This is a subpoint
2. Second thing
2.2 Second subthing
.setCustomBounds([r"\n[\d\.]+"])
.setCustomBoundsStrategy("prepend")
the result will be
[
"1. This is a list",
"1.1 This is a subpoint",
"2. Second thing",
"2.2 Second subthing"
]
Bug Fixes
- Fix bug that attempts to create spark session on executors when using GraphExtraction in Spark/PySpark 3.3 #9905
Models and Pipelines
Spark NLP 4.0.2 comes with 620+ state-of-the-art pre-trained transformer models in 21 languages including multi-lingual models.
Featured Models
The complete list of all 6900+ models & pipelines in 230+ languages is available on Models Hub
π Documentation & Articles
- Spark NLP: Hardware Acceleration
- Serving Spark NLP via API in Java
- TF Hub & HuggingFace to Spark NLP
- Models Hub with new models
- Spark NLP documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
- Spark NLP Workshop notebooks
- Spark NLP publications
- Spark NLP in Action
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP Display for visualization of different types of annotations
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Installation
Python
#PyPI
pip install spark-nlp==4.0.2
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.2
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.2
M1
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.2
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>4.0.2</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>4.0.2</version>
</dependency>
spark-nlp-m1:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-m1_2.12</artifactId>
<version>4.0.2</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.2.jar
-
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.0.2.jar
-
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.0.2.jar
What's Changed
Contributors
@gadde5300 @danilojsl @hsaglamlar @Cabir40 @ahmedlone127 @muhammetsnts @KshitizGIT @maziyarpanahi @albertoandreottiATgmail @DevinTDHa @luca-martial @Damla-Gurbaz @jsl-models @Meryem1425
New Contributors
- @hsaglamlar made their first contribution in #10544
Full Changelog: 4.0.1...4.0.2
Spark NLP 4.0.1: Full support for Apache Spark 3.3.0, new Databricks runtime 11, enhancements, and other bug fixes!
Overview
We are pleased to release Spark NLP π 4.0.1! This release comes with supporting the newly-released Apache Spark 3.3.0 with improved join query performance via Bloom filters, increases the Pandas API coverage, and many other improvements. In addition, Spark NLP comes with official support for Databricks runtimes 11, other enhancements, and bug fixes.
As always, we would like to thank our community for their feedback, questions, and feature requests.
Features & Enhancements
- Full support for Apache Spark & PySpark 3.3.0
- Add Apache Spark 3.3.0 to Google Colab and Kaggle setup scripts
- New
-g
option for Google Colab and Kaggle setup on GPU device to upgradelibcudnn8
to 8.1.0 to solve the issue on GPU - Welcoming new Databricks runtimes based on Spark/PySpark 3.3.0 to our Spark NLP family:
- Databricks 11.0 LTS
- Databricks 11.0 LTS ML
- Databricks 11.0 LTS ML GPU
Bug Fixes
- Fix the error caused by PySpark 3.3.0 in CoNLL, CoNLLU, POS, and PubTator annotators as training helpers
- Fix and re-upload Dependency and Type Dependency parser pre-trained models
- Update pre-trained pipelines with issues on PySpark 3.2 and 3.3
Documentation
- Serving Spark NLP via API in Java
- TF Hub & HuggingFace to Spark NLP
- Models Hub with new models
- Spark NLP documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
- Spark NLP Workshop notebooks
- Spark NLP publications
- Spark NLP in Action
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP Display for visualization of different types of annotations
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Installation
Python
#PyPI
pip install spark-nlp==4.0.1
Spark Packages
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.1
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.1
M1
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.1
Maven
spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>4.0.1</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>4.0.1</version>
</dependency>
spark-nlp-m1:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-m1_2.12</artifactId>
<version>4.0.1</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.1.jar
-
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.0.1.jar
-
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.0.1.jar
What's Changed
Contributors
@muhammetsnts @jsl-models @Meryem1425 @Damla-Gurbaz @jsl-builder @rpranab @danilojsl @josejuanmartinez @Cabir40 @DevinTDHa @agsfer @suvrat-joshi @ahmedlone127 @albertoandreottiATgmail @KshitizGIT @mahmoodbayeshi @maziyarpanahi
New Contributors
- @ahmedlone127 made their first contribution in #9887
Full Changelog: 4.0.0...4.0.1
Spark NLP 4.0.0: New modern extractive Question answering (QA) annotators for ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa, official support for Apple silicon M1, support oneDNN to improve CPU up to 97%, improved transformers on GPU up to +700%, 1000+ state-of-the-art models, and lots more!
Overview
We are very excited to release Spark NLP 4.0.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community! π
This release comes with official support for Apple silicon M1 chip (for the first time), official support for Spark/PySpark 3.2, support oneAPI Deep Neural Network Library (oneDNN) to improve TensorFlow on CPU up to 97%, optimized transformer-based embeddings on GPU to increase the performance up to +700%, brand new modern extractive transformer-based Question answering (QA) annotators for tasks like SQuAD based on ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa architectures, 1000+ state-of-the-art models, WordEmbeddingsModel now works in clusters without HDFS/DBFS/S3 such as Kubernetes, new Databricks and EMR support, new NER models achieving highest F1 score in Spark NLP, and many more enhancements and bug fixes!
We would like to mention that Spark NLP 4.0.0 drops the support for Spark 2.3 and 2.4 (Scala 2.11). Starting 4.0.0 we only support Spark/PySpark 3.x on Scala 2.12.
As always, we would like to thank our community for their feedback, questions, and feature requests.
Major features and improvements
- NEW: Support for The oneAPI Deep Neural Network Library (oneDNN) optimizations to improve TensorFlow on CPU. Enabling onDNN can improve some transformer-based models up to 97%. By default, the oneDNN optimizations will be turned off. To enable them, you can set the environment variable TF_ENABLE_ONEDNN_OPTS. On Linux systems, for instance:
export TF_ENABLE_ONEDNN_OPTS=1
- NEW: Optimizing batch processing for transformer-based Word Embeddings on a GPU device. These optimizations can result in performance improvements up to +700% (more details in the Benchmarks section)
- NEW: Official support for Apple silicon M1 on macOS devices. You can use the
spark-nlp-m1
package that supports Apple silicon M1 on your macOS machine in Spark NLP 4.0.0 - NEW: Introducing AlbertForQuestionAnswering annotator in Spark NLP π.
AlbertForQuestionAnswering
can loadALBERT
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingAlbertForQuestionAnswering
for PyTorch orTFAlbertForQuestionAnswering
for TensorFlow models in HuggingFace π€ - NEW: Introducing BertForQuestionAnswering annotator in Spark NLP π.
BertForQuestionAnswering
can loadBERT
&ELECTRA
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingBertForQuestionAnswering
andElectraForQuestionAnswering
for PyTorch orTFBertForQuestionAnswering
andTFElectraForQuestionAnswering
for TensorFlow models in HuggingFace π€ - NEW: Introducing DeBertaForQuestionAnswering annotator in Spark NLP π.
DeBertaForQuestionAnswering
can loadDeBERTa
v2&v3 Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingDebertaV2ForQuestionAnswering
for PyTorch orTFDebertaV2ForQuestionAnswering
for TensorFlow models in HuggingFace π€ - NEW: Introducing DistilBertForQuestionAnswering annotator in Spark NLP π.
DistilBertForQuestionAnswering
can loadDistilBERT
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingDistilBertForQuestionAnswering
for PyTorch orTFDistilBertForQuestionAnswering
for TensorFlow models in HuggingFace π€ - NEW: Introducing LongformerForQuestionAnswering annotator in Spark NLP π.
LongformerForQuestionAnswering
can loadLongformer
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingLongformerForQuestionAnswering
for PyTorch orTFLongformerForQuestionAnswering
for TensorFlow models in HuggingFace π€ - NEW: Introducing RoBertaForQuestionAnswering annotator in Spark NLP π.
RoBertaForQuestionAnswering
can loadRoBERTa
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingRobertaForQuestionAnswering
for PyTorch orTFRobertaForQuestionAnswering
for TensorFlow models in HuggingFace π€ - NEW: Introducing XlmRoBertaForQuestionAnswering annotator in Spark NLP π.
XlmRoBertaForQuestionAnswering
can loadXLM-RoBERTa
Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by usingXLMRobertaForQuestionAnswering
for PyTorch orTFXLMRobertaForQuestionAnswering
for TensorFlow models in HuggingFace π€ - NEW: Introducing MultiDocumentAssembler annotator where multiple inputs require to be converted to DOCUMENT such as in XXXForQuestionAnswering annotators
- NEW: Introducing SpanBertCorefModel annotator for Coreference Resolution on BERT and SpanBERT models based on BERT for Coreference Resolution: Baselines and Analysis paper. An implementation of a SpanBert-based coreference resolution model.
- NEW: Introducing
enableInMemoryStorage
parameter inWordEmbeddingsModel
annotator. By enabling this parameter the annotator will no longer require a distributed storage to unpack indices and will perform everything in-memory. - Official support for Apache Spark and PySpark 3.2.x on Scala 2.12. Spark NLP by default is shipped for Spark 3.2.x and supports Spark/PySpark 3.0.x and 3.1.x in additions
- Unifying all supported Apache Spark packages on Maven into
spark-nlp
for CPU,spark-nlp-gpu
for GPU, andspark-nlp-m1
for new Apple silicon M1 on macOS. The need for Apache Spark specific packages likespark-nlp-spark32
has been removed. - Adding a new param to
sparknlp.start()
function in Python and Scala for Apple silicon M1 on macOS (m1=True
) - Upgrade TensorFlow to 2.7.1 and start supporting Apple silicon M1
- Upgrade RocksDB with new enhancements and support for Apple silicon M1
- Upgrade SentencePiece tokenizer TF ops to 2.7.1
- Upgrade SentencePiece JNI to v0.1.96 and provide support for Apple silicon M1 on macOS support
- Upgrade to Scala 2.12.15
- Update Colab, Kaggle, and SageMaker scripts
- Refactor the entire Python module in Spark NLP to make the development and maintenance easier
- Refactor unit tests in Python and migrate to pytest
- Welcoming 6x new Databricks runtimes to our Spark NLP family:
- Databricks 10.4 LTS
- Databricks 10.4 LTS ML
- Databricks 10.4 LTS ML GPU
- Databricks 10.5
- Databricks 10.5 ML
- Databricks 10.5 ML GPU
- Welcoming a new EMR 6.x series to our Spark NLP family:
- EMR 6.6.0 (Apache Spark 3.2.0 / Hadoop 3.2.1)
- Migrate T5Transformer to TensorFlow v2 architecture by re-uploading all the existing models
- Support for 2 inputs in LightPipeline with MultiDocumentAssembler
- Add new default NerDL graph for xsmall DeBERTa embeddings model (384 dimensions)
- Adding annotateJava method to PretrainedPipeline class in Java to facilitate the use of LightPipelines
- Allow change of case sensitivity. Currently, the user cannot set the
setCaseSensitive
param. This allows users to change this value if the model was saved/uploaded with the wrong case sensitivity parameter. (BERT, ALBERT, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, and Longformer for XXXForSequenceClassitication and XXXForTokenClassification. - Keep accuracy in ClassifierDL and SentimentDL during the training between 0.0 and 1.0
- Preserve the original form of the token in BPE Tokenizer used in RoBERTa annotators (used in embeddings, sequence and token classification)
Performance Improvements (Benchmarks)
We have introduced two major performance improvements for GPU and CPU devices in Spark NLP 4.0.0 release.
The following benchmarks have been done by using a single Dell Server with the following specs:
- GPU: Tesla P100 PCIe 12GB - CUDA Version: 11.3 - Driver Version: 465.19.01
- CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz - 40 Cores
- Memory: 80G
GPU
We have improved our batch processing approach for transformer-based Word Embeddings to improve their performance on a GPU device. These optimizations result in performance improvements up to +700%. The detailed list of improved transformer models on GPU in comparison to Spark NLP 3.4.x:
Model on GPU | Spark NLP 3.4.3 vs. 4.0.0 |
---|---|
RoBERTa base | +560%(6.6x) |
RoBERTa Large | +332%(4.3x) |
Albert Base | +587%(6.9x... |
Spark NLP 3.4.4: New DeBERTa for Token Classification, new CamemBERT embeddings, speed improvements for Tokenizer and UniversalSentenceEncoder annotators, over 160 new state-of-the-art models, and other improvements!
Overview
We are very excited to release Spark NLP π 3.4.4! This release comes with a new DeBERTa for Token Classification annotator compatible with existing or fine-tuned models on HuggingFace π€, a new annotator for CamemBERT embeddings models, up to 18x times improvements of UniversalSentenceEncoder on GPU devices, up to 400% speed improvements in Tokenizer with a list of exceptions, new state-of-the-art NER, French embeddings, DistilBERT embeddings, and ALBERT embeddings models!
As always, we would like to thank our community for their feedback, questions, and feature requests.
New Features
- NEW: Introducing DeBertaForTokenClassification annotator in Spark NLP π.
DeBertaForTokenClassification
can load DeBERTa v2&v3 models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by usingDebertaV2ForTokenClassification
for PyTorch orTFDebertaV2ForTokenClassification
for TensorFlow models in HuggingFace #8082 - NEW: Introducing CamemBertEmbeddings annotator in Spark NLP π. #8237 CamemBERT is a state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the newly available multilingual corpus OSCAR. For further information or requests, please go to Camembert Website
- Add support for batching rows to improve UniversalSentenceEncoder on GPU devices. This new feature will increase GPU speed between 2x to 18x times depending on the distribution of sentences #8234
Bug Fixes & Enhancements
- Optimizing Tokenizer performance up to 400% when there is an exceptions list. We have improved the
exceptions list
to be scalable to a large number of exceptions without impacting the overall performance #7881 - Support latest PySpark releases in Colab, Kaggle, and SageMaker scripts #8028
- Fix bug that caused get input/output/LazyAnnotator to return None #8043
- Fix DeBertaForSequenceClassification in Python failing to load pretrained models #8060
- Fix missing Lemma and POS models from 3.4.3 release
Dependencies
- Removing outdated trove4j dependency in favour of native Java modules #8236
- Upgrade the base Apache Spark to
2.4.8
,3.0.3
, and3.2.1
- Upgrade type typesafe config to
1.4.2
- Upgrade sbt to
1.6.2
Models
Spark NLP 3.4.4 comes with over 160+ state-of-the-art multi-lingual pretrained models. Some of the featured models:
New DeBERTa Token Classification Models
New fine-tuned DeBERTa v3 models for token classifications over CoNLL03 and OntoNotes datasets that reach state-of-the-art metrics.
Model | Name | Lang | F1 Dev |
---|---|---|---|
DeBertaForTokenClassification | deberta_v3_large_token_classifier_conll03 | en |
0.97 |
DeBertaForTokenClassification | deberta_v3_base_token_classifier_conll03 | en |
0.96 |
DeBertaForTokenClassification | deberta_v3_small_token_classifier_conll03 | en |
0.95 |
DeBertaForTokenClassification | deberta_v3_xsmall_token_classifier_conll03 | en |
0.93 |
DeBertaForTokenClassification | deberta_v3_large_token_classifier_ontonotes | en |
0.89 |
DeBertaForTokenClassification | deberta_v3_base_token_classifier_ontonotes | en |
0.88 |
DeBertaForTokenClassification | deberta_v3_small_token_classifier_ontonotes | en |
0.87 |
DeBertaForTokenClassification | deberta_v3_xsmall_token_classifier_ontonotes | en |
0.86 |
New CamemBERT Models
Model | Name | Lang |
---|---|---|
CamemBertEmbeddings | camembert_large | fr |
CamemBertEmbeddings | camembert_base | fr |
CamemBertEmbeddings | camembert_base_ccnet_4gb | fr |
CamemBertEmbeddings | camembert_base_ccnet | fr |
CamemBertEmbeddings | camembert_base_oscar_4gb | fr |
CamemBertEmbeddings | camembert_base_wikipedia_4gb | fr |
New DistilBERT Embeddings Models
Model | Name | Lang |
---|---|---|
DistilBertEmbeddings | distilbert_embeddings_distilbert_base_fr_cased | fr |
DistilBertEmbeddings | distilbert_embeddings_marathi_distilbert | mr |
DistilBertEmbeddings | distilbert_embeddings_distilbert_base_indonesian | id |
DistilBertEmbeddings | distilbert_embeddings_javanese_distilbert_small | jv |
DistilBertEmbeddings | distilbert_embeddings_malaysian_distilbert_small | ms |
DistilBertEmbeddings | distilbert_embeddings_distilbert_base_ar_cased | ar |
New ALBERT Embeddings Models
Model | Name | Lang |
---|---|---|
AlbertEmbeddings | albert_embeddings_fralbert_base | fr |
AlbertEmbeddings | albert_embeddings_albert_base_arabic | ar |
AlbertEmbeddings | albert_embeddings_marathi_albert_v2 | mr |
AlbertEmbeddings | albert_embeddings_albert_fa_base_v2 | fa |
AlbertEmbeddings | albert_embeddings_albert_large_bahasa_cased | ms |
AlbertEmbeddings | albert_embeddings_marathi_albert | mr |
The complete list of all 5000+ models & pipelines in 200+ languages is available on Models Hub.
New Notebooks
Import CamemBERT models to Spark NLP π
Spark NLP | HuggingFace Notebooks | Colab |
---|---|---|
CamemBertEmbeddings | HuggingFace in Spark NLP - CamemBERT |
You can visit Import Transformers in Spark NLP for more info
Documentation
- TF Hub & HuggingFace to Spark NLP
- Models Hub with new models
- Spark NLP documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
- Spark NLP Workshop notebooks
- Spark NLP publications
- Spark NLP in Action
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP Display for visualization of different types of annotations
- [Discussions](https://github.com/John...
John Snow Labs Spark-NLP 3.4.3: New DeBERTa for Sequence Classification, sigmoid activation for sequence classifiers, new features for SentenceDetectorDL, over 600 new multi-lingual models, and other improvements!
Overview
We are very excited to release Spark NLP π 3.4.3! This release comes with a new DeBERTa for Sequence Classification annotator compatible with existing or fine-tuned models on HuggingFace π€, a new sigmoid activation function in addition to softmax to support multi-label models in all ForSequenceClassification annotators, new features added to SentenceDetectorDL, new features added to CoNLLU and Lemmatizer, and more than 600 new multi-lingual models for DeBERTa, BERT, DistilBERT, fastText, Lemmatizer and Part of Speech, and other improvements!
As always, we would like to thank our community for their feedback, questions, and feature requests.
New Features
- NEW: Introducing DeBertaForSequenceClassification annotator in Spark NLP π.
DeBertaForSequenceClassification
can load DeBERTa v2&v3 models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingDebertaForSequenceClassification
for PyTorch orTFDebertaForSequenceClassification
for TensorFlow models in HuggingFace #7713 - New multi-label feature in all SequenceForClassification. The following annotators now have the option to switch to sigmoid activation function instead of softmax for the output layer: AlbertForSequenceClassification, BertForSequenceClassification, DeBertaForSequenceClassification, DistilBertForSequenceClassification, LongformerForSequenceClassification, RoBertaForSequenceClassification, XlmRoBertaForSequenceClassification, and XlnetForSequenceClassification #7479
- New minLength, maxLength, splitLength, customBounds, and useCustomBoundsOnly parameters in SentenceDetectorDL #7214
- New impossiblePenultimates in SentenceDetectorDLModel #7685
- New feature to set names for columns in CoNLLU class: textCol, documentCol, sentenceCol, formCol, uposCol, xposCol, and lemmaCol #7344
- New formCol and lemmaCol parameters in Lemmatizer annotator #7344
- Add new functionality to download and extract models from S3 via direct link #7682
Enhancements
- Fix and train new English spell checker models for Spark NLP 3.4.1 on Spark 3.x and 2.x
- Update SentenceDetector Python and Scala documentation
- Add a missing notebook to demonstrate training a WordSegmenterApproach annotator for word segmentation https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/chinese/word-segmentation/WordSegmenter_train_chinese_segmentation.ipynb
Models
New DeBERTa Classification Models
New fine-tuned DeBERTa v3 models for text classifications over IMDB reviews in English and Urdu, AG News categories in English, and Allocine French reviews.
Model | Name | Lang |
---|---|---|
DeBertaForSequenceClassification | mdeberta_v3_base_sequence_classifier_imdb | ur |
DeBertaForSequenceClassification | mdeberta_v3_base_sequence_classifier_allocine | fr |
DeBertaForSequenceClassification | deberta_v3_xsmall_sequence_classifier_imdb | en |
DeBertaForSequenceClassification | deberta_v3_small_sequence_classifier_imdb | en |
DeBertaForSequenceClassification | deberta_v3_base_sequence_classifier_imdb | en |
DeBertaForSequenceClassification | deberta_v3_large_sequence_classifier_imdb | en |
DeBertaForSequenceClassification | deberta_v3_xsmall_sequence_classifier_ag_news | en |
DeBertaForSequenceClassification | deberta_v3_small_sequence_classifier_ag_news | en |
New BERT Models
Spark NLP now has up to 250 state-of-the-art BERT models in 27 languages including Arabic, Bengali, Chinese, Dutch, English, Finnish, French, German, Greek, Hindi, Italian, Japanese, Javanese, Korean, Marathi, Panjabi, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Telugu, Turkish, Urdu, Vietnamese, and Multi-lingual.
Model | Name | Lang |
---|---|---|
BertEmbeddings | bert_embeddings_ARBERT | ar |
BertEmbeddings | bert_embeddings_German_MedBERT | de |
BertEmbeddings | bert_embeddings_bangla_bert_base | bn |
BertEmbeddings | bert_embeddings_bert_base_5lang_cased | zh |
BertEmbeddings | bert_embeddings_bert_base_5lang_cased | fr |
BertEmbeddings | bert_embeddings_bert_base_hi_cased | hi |
BertEmbeddings | bert_embeddings_bert_base_it_cased | it |
BertEmbeddings | bert_embeddings_bert_base | ko |
BertEmbeddings | bert_embeddings_bert_base_tr_cased | tr |
BertEmbeddings | bert_embeddings_bert_base_ur_cased | ur |
BertEmbeddings | bert_embeddings_bert_base_vi_cased | vi |
New fastText Models
Over 128 new Word2Vec models in 128 languages made by fastText word embeddings.
Model | Name | Lang |
---|---|---|
WordEmbeddingsModel | w2v_cc_300d | hi |
WordEmbeddingsModel | w2v_cc_300d | azb |
WordEmbeddingsModel | w2v_cc_300d | bo |
WordEmbeddingsModel | w2v_cc_300d | diq |
WordEmbeddingsModel | w2v_cc_300d | cy |
WordEmbeddingsModel | w2v_cc_300d | ckb |
WordEmbeddingsModel | w2v_cc_300d | el |
WordEmbeddingsModel | w2v_cc_300d | es |
New Lemmatizer and Part of Speech Models
234 new Lemmatizer and Part of Speech models in 62 languages based on the new Universal Dependency treebank 2.9 release.
Model | Name | Lang |
---|---|---|
LemmatizerModel | lemma_afribooms | af |
LemmatizerModel | lemma_alksnis | lt |
LemmatizerModel | lemma_alpino | nl |
LemmatizerModel | lemma_arcosg | gd |
LemmatizerModel | lemma_ancora | es |
LemmatizerModel | lemma_ancora | ca |
PerceptronModel | pos_mtg | te |
PerceptronModel | pos_ttb | ta |
PerceptronModel | pos_vtb | vi |
PerceptronModel | pos_cac | cs |
PerceptronModel | pos_btb | bg |
PerceptronModel | pos_afribooms | af |
The complete list of all 4800+ models & pipelines in 200+ languages is available on Models Hub.
Documentation
- [T...
John Snow Labs Spark-NLP 3.4.2: DeBERTa embeddings, new caching in Word2Vec and Doc2Vec, new state-of-the-art models, and bug fixes!
Overview
We are pleased to release Spark NLP π 3.4.2! This release comes with a new DeBERTa transformer for word embeddings, new caching to speed up training Word2Vec and Doc2Vec, new English and multi-lingual state-of-the-art models, and bug fixes!
As always, we would like to thank our community for their feedback, questions, and feature requests.
New Features
- Introducing DeBertaEmbeddings annotator. DeBERTa (Decoding-enhanced BERT with disentangled attention) improves the BERT and RoBERTa models using two novel techniques. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). This annotator is compatible with all the models trained/fine-tuned by using
DebertaV2Model
for PyTorch orTFDebertaV2Model
for TensorFlow models (DeBERTa-v2 & DeBERTa-v3) in HuggingFace - Introducing a new param
enableCaching
in Doc2VecApproach to speed up the training - Introducing a new param
enableCaching
in Word2VecApproach to speed up the training - Support Databricks runtime 10.3, 10.3 ML, and 10.3 ML & GPU
- Support EMR emr-5.34.0 and emr-6.5.0
Bug Fixes
- Fix bestModelMetric param when the set value was ignored #6978
New Notebooks
Import DeBERTa models to Spark NLP π
Spark NLP | HuggingFace Notebooks | Colab |
---|---|---|
DeBertaEmbeddings | HuggingFace in Spark NLP - DeBERTa |
You can visit Import Transformers in Spark NLP for more info
Models
New state-of-the-art DeBERTa models:
Model | Name | Lang |
---|---|---|
DeBertaEmbeddings | deberta_v3_xsmall | en |
DeBertaEmbeddings | deberta_v3_small | en |
DeBertaEmbeddings | deberta_v3_base | en |
DeBertaEmbeddings | deberta_v3_large | en |
DeBertaEmbeddings | mdeberta_v3_base | xx |
Documentation
- TF Hub & HuggingFace to Spark NLP
- Models Hub with new models
- Spark NLP documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
- Spark NLP Workshop notebooks
- Spark NLP publications
- Spark NLP in Action
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP Display for visualization of different types of annotations
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Installation
Python
#PyPI
pip install spark-nlp==3.4.2
Spark Packages
spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.2
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.2
spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.2
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.2
spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.2
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.2
spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.2
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.2
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.2
Maven
spark-nlp on Apache Spark 3.0.x and 3.1.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>3.4.2</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>3.4.2</version>
</dependency>
spark-nlp on Apache Spark 3.2.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark32_2.12</artifactId>
<version>3.4.2</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark32_2.12</artifactId>
<version>3.4.2</version>
</dependency>
spark-nlp on Apache Spark 2.4.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark24_2.11</artifactId>
<version>3.4.2</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
<version>3.4.2</version>
</dependency>
spark-nlp on Apache Spark 2.3.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark23_2.11</artifactId>
<version>3.4.2</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
<version>3.4.2</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.4.2.jar
-
GPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.4.2.jar
-
CPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark32-assembly-3.4.2.jar
-
GPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark32-assembly-3.4.2.jar
-
CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.4.2.jar
-
GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.4.2.jar
-
CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.4.2.jar
-
GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.4.2.jar
What's Changed
Full Changelog: 3.4.1...3.4.2
New Contributors
- @mahmoodbayeshi made their first contribution in #6835
- @bunyamin-polat made their first contribution in #6969
@agsfer @KshitizGIT @gadde5300 @kolia1985 @jsl-models @rpranab @josejuanmartinez @bunyamin-polat @maziyarpanahi @jsl-builder @Damla-Gurbaz @xusliebana @mahmoodbayeshi @luca-martial @dependabot @muhammetsnts @albertoandreottiATgmai
John Snow Labs Spark-NLP 3.4.1: TF session warmup, a new F1 metric to track to save the best model in NerDL, new T5 models like WikiSQL or grammar corrector, other new multi-lingual state-of-the-art models, and bug fixes!
Overview
We are pleased to release Spark NLP π 3.4.1! This release comes with a TF session warmup in 3 annotators where the first inference was slower than the rest, adding a new param to choose which F1 to track to save the best model when training a NerDL model, new T5 models such as text to SQL or grammar correction, new multi-lingual state-of-the-art models, and other bug fixes!
As always, we would like to thank our community for their feedback, questions, and feature requests.
New Features & Enhancements
- Implement TF Session warmup for MarianTransformer, T5Transformer, and GPT2Transformer annotators. The first inference for these annotators used to take between 15-20 seconds, now with the warmup session all the inferences including the first time will be the same #6773
- Add bestModelMetric param to choose between Micro-average or Macro-average for best model #6749
- Add trimWhitespace and preservePosition params to RegexTokenizer #6806
- Add a new
setSentenceMatch
param to EntityRuler to match entities across documents/sentences and not just tokens #6841 - Add support spark32 and real_time_output flags in sparknlp.start() function at the same time #6822
- Allow users to set tasks in the T5Transformer annotator
Bug Fixes
- Fix random NullPointerException when using TensorFlow models without Kyro serialization #6741
- Fix RecursiveTokenizerModel not being readable in a saved Pipeline #6748
- Fix ContextSpellCheckerApproach not being trained on Databricks #6750
- Fix ContextSpellCheckerModel wrong order of tokens it's used with Sentence Detectors #6799
- Fix GraphExtraction when fullAnnotate and document are used at the same time #6845
- Fix Word2VecModel being cast to Doc2VecModel by mistake #6849
- Fix broken sentence indexing in BertEmbeddings that impacted SentenceEmbeddings for text classification #6867
- Fix missing setExceotionsPath param in Tokenizer when it's used in Python #6868
- Fix the wrong metrics being mentioned when useBestModel was enabled. The documentation said Micro-averaged F1 but in fact, it was Macro-average F1 (the option to choose which metric to be tracked is now available as well)
- Update broken slow unit tests #6767
Models
New state-of-the-art models in English, French, Vietnamese, Dutch, and Indian (Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu)
Featured Pretrained Models
Model | Name | Lang |
---|---|---|
T5Transformer | t5_informal_to_formal_styletransfer | en |
T5Transformer | t5_formal_to_informal_styletransfer | en |
T5Transformer | t5_passive_to_active_styletransfer | en |
T5Transformer | t5_active_to_passive_styletransfer | en |
T5Transformer | t5_grammar_error_corrector | en |
T5Transformer | t5_small_wikiSQL | en |
LongformerEmbeddings | clinical_longformer | en |
AlbertEmbeddings | albert_indic | xx |
DistilBertEmbeddings | distilbert_base_cased | vi |
BertForSequenceClassification | bert_sequence_classifier_news_sentiment | de |
BertForSequenceClassification | bert_sequence_classifier_emotion | en |
DistilBertForTokenClassification | distilbert_token_classifier_typo_detector | en |
DistilBertForTokenClassification | distilbert_base_token_classifier_masakhaner | xx |
WordEmbeddingsModel | word2vec_wiki_1000 | fr |
WordEmbeddingsModel | word2vec_wac_200 | fr |
WordEmbeddingsModel | w2v_cc_300d | fr |
Documentation
- TF Hub & HuggingFace to Spark NLP
- Models Hub with new models
- Spark NLP documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
- Spark NLP Workshop notebooks
- Spark NLP publications
- Spark NLP in Action
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP Display for visualization of different types of annotations
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Installation
Python
#PyPI
pip install spark-nlp==3.4.1
Spark Packages
spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.1
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.1
spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.1
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.1
spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.1
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.1
spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.1
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.1
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.1
Maven
spark-nlp on Apache Spark 3.0.x and 3.1.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>3.4.1</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>3.4.1</version>
</dependency>
spark-nlp on Apache Spark 3.2.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark32_2.12</artifactId>
<version>3.4.1</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark32_2.12</artifactId>
<version>3.4.1</version>
</dependency>
spark-nlp on Apache Spark 2.4.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark24_2.11</artifactId>
<version>3.4.1</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
<version>3.4.1</version>
</dependency>
spark-nlp on Apache Spark 2.3.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark23_2.11</artifactId>
<version>3.4.1</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark23_2.11...
John Snow Labs Spark-NLP 3.4.0: New OpenAI GPT-2, new ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer for Sequence Classification, support for Spark 3.2, new distributed Word2Vec, extend support to more Databricks & EMR runtimes, new state-of-the-art transformer models, bug fixes, and lots more!
Overview
We are very excited to release Spark NLP 3.4.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community at the dawn of 2022! π
Spark NLP 3.4.0 extends the support for Apache Spark 3.2.x major releases on Scala 2.12. We now support all 5 major Apache Spark and PySpark releases of 2.3.x, 2.4.x, 3.0.x, 3.1.x, and 3.2.x at once helping our community to migrate from earlier Apache Spark versions to newer releases without being worried about Spark NLP end of life support. We also extend support for new Databricks and EMR instances on Spark 3.2.x clusters.
This release also comes with a brand new GPT2Transformer using OpenAI GPT-2 models for prediction at scale, new ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer annotators to use existing or fine-tuned models for Sequence Classification, new distributed and trainable Word2Vec annotators, new state-of-the-art transformer models in many languages, a new param to useBestModel in NerDL during training, bug fixes, and lots more!
As always, we would like to thank our community for their feedback, questions, and feature requests.
Major features and improvements
- NEW: Introducing GPT2Transformer annotator in Spark NLP π for Text Generation purposes.
GPT2Transformer
uses OpenAI GPT-2 models from HuggingFace π€ for prediction at scale in Spark NLP π .GPT-2
is a transformer model trained on a very large corpus of English data in a self-supervised fashion. This means it was trained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences - NEW: Introducing RoBertaForSequenceClassification annotator in Spark NLP π.
RoBertaForSequenceClassification
can load RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingRobertaForSequenceClassification
for PyTorch orTFRobertaForSequenceClassification
for TensorFlow models in HuggingFace π€ - NEW: Introducing XlmRoBertaForSequenceClassification annotator in Spark NLP π.
XlmRoBertaForSequenceClassification
can load XLM-RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingXLMRobertaForSequenceClassification
for PyTorch orTFXLMRobertaForSequenceClassification
for TensorFlow models in HuggingFace π€ - NEW: Introducing LongformerForSequenceClassification annotator in Spark NLP π.
LongformerForSequenceClassification
can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingLongformerForSequenceClassification
for PyTorch orTFLongformerForSequenceClassification
for TensorFlow models in HuggingFace π€ - NEW: Introducing AlbertForSequenceClassification annotator in Spark NLP π.
AlbertForSequenceClassification
can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingAlbertForSequenceClassification
for PyTorch orTFAlbertForSequenceClassification
for TensorFlow models in HuggingFace π€ - NEW: Introducing XlnetForSequenceClassification annotator in Spark NLP π.
XlnetForSequenceClassification
can load XLNet Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingXLNetForSequenceClassification
for PyTorch orTFXLNetForSequenceClassification
for TensorFlow models in HuggingFace π€ - NEW: Introducing trainable and distributed Word2Vec annotators based on Word2Vec in Spark ML. You can train Word2Vec in a cluster on multiple machines to handle large-scale datasets and use the trained model for token-level classifications such as NerDL
- Introducing
useBestModel
param in NerDLApproach annotator. This param in the NerDLApproach preserves and restores the model that has achieved the best performance at the end of the training. The priority is metrics from testDataset (micro F1), metrics from validationSplit (micro F1), and if none is set it will keep track of loss during the training - Support Apache Spark and PySpark 3.2.x on Scala 2.12. Spark NLP by default is shipped for Spark 3.0.x/3.1.x, but now you have
spark-nlp-spark32
andspark-nlp-gpu-spark32
packages - Adding a new param to sparknlp.start() function in Python for Apache Spark 3.2.x (
spark32=True
) - Update Colab and Kaggle scripts for faster setup. We no longer need to remove Java 11 in order to install Java 8 since Spark NLP works on Java 11. This makes the installation of Spark NLP on Colab and Kaggle as fast as
pip install spark-nlp pyspark==3.1.2
- Add new scripts/notebook to generate custom TensroFlow graphs for
ContextSpellCheckerApproach
annotator - Add a new
graphFolder
param toContextSpellCheckerApproach
annotator. This param allows to train ContextSpellChecker from a custom made TensorFlow graph - Support DBFS file system in
graphFolder
param. Starting Spark NLP 3.4.0 you can point NerDLApproach or ContextSpellCheckerApproach to a TF graph hosted on Databricks - Add a new feature to all classifiers (
ForTokenClassification
andForSequenceClassification
) to retrieve classes from the pretrained models
sequenceClassifier = XlmRoBertaForSequenceClassification \
.pretrained('xlm_roberta_base_sequence_classifier_ag_news', 'en') \
.setInputCols(['token', 'document']) \
.setOutputCol('class')
print(sequenceClassifier.getClasses())
#Sports, Business, World, Sci/Tech
- Add
inputFormats
param to DateMatcher and MultiDateMatcher annotators. DateMatcher and MultiDateMatcher can now define a list of acceptable input formats via date patterns to search in the text. Consequently, the output format will be defining the output pattern for the unique output format.
date_matcher = DateMatcher() \
.setInputCols(['document']) \
.setOutputCol("date") \
.setInputFormats(["yyyy", "yyyy/dd/MM", "MM/yyyy"]) \
.setOutputFormat("yyyyMM") \ #previously called `.setDateFormat`
.setSourceLanguage("en")
- Enable batch processing in T5Transformer and MarianTransformer annotators
- Add Schema to
readDataset
in CoNLL() class - Welcoming 6x new Databricks runtimes to our Spark NLP family:
- Databricks 10.0
- Databricks 10.0 ML GPU
- Databricks 10.1
- Databricks 10.1 ML GPU
- Databricks 10.2
- Databricks 10.2 ML GPU
- Welcoming 3x new EMR 6.x series to our Spark NLP family:
- EMR 5.33.1 (Apache Spark 2.4.7 / Hadoop 2.10.1)
- EMR 6.3.1 (Apache Spark 3.1.1 / Hadoop 3.2.1)
- EMR 6.4.0 (Apache Spark 3.1.2 / Hadoop 3.2.1)
Bug Fixes
- Fix a race condition in a cluster mode when the accessing TF session is called as many times as the number of available cores on the Driver machine for the very first time. Loading a model multiple times at once results in higher disk usage and IO may become a bottleneck for larger models especially on a machine with slower disks. Thanks to @jerrychenhf for finding this issue and offering a solution #6575
- Fix a performance issue introduced in the 3.3.3 release for T5Transformer and MarianTransformer annotators. While we added support for ignored tokens, accidentally we introduced a bug that degraded the performance for these two annotators (sometimes up to 2x slower). Please update to 3.4.0 if you are using any of these two annotators #6605
- Fix a bug in model resolution by not filtering based on the timestamp
- Fix configProtoBytes param type in Python #6549
- Fix missing DefaultParamsReadable in RegexTokenizer annotator #6653
- Fix missing models
lemma_antbnc
,sentiment_vivekn
, andspellcheck_norvig
for Spark 3.x - Fix missing pipelines
clean_slang
,check_spelling
,match_chunks
, andmatch_datetime
for Spark 3.x - Fix
saveModel
in TrainingHelper - Fix Keyword/Yake module naming in Scala #6562
Models Hub
Models Hub now comes with new features to easily filter and find your desired models & pipelines by:
- NLP Task
- Natural Language
- Spark NLP version
In addition, you can also filter models & pipelines by:
- Models or Pipelines (finally! π )
- Tags used inside Model's card
- Or even by predicted entities (which labels/classes a model can predict)
As always, you can host your own pre-trained models & pipelines easily accessible to you for free & forever! π
Models and Pipelines
--------------...
John Snow Labs Spark-NLP 3.3.4: Patch release
Patch release
- Fix
ClassCastException
error in pretrained function for DistilBertForSequenceClassification in Python #6513
Documentation
- TF Hub & HuggingFace to Spark NLP
- Models Hub with new models
- Spark NLP publications
- Spark NLP in Action
- Spark NLP documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
- Spark NLP Workshop notebooks
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP Display for visualization of different types of annotations
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Installation
Python
#PyPI
pip install spark-nlp==3.3.4
Spark Packages
spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.4
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.4
spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.4
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.4
spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.4
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.3.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.3.4
Maven
spark-nlp on Apache Spark 3.0.x and 3.1.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>3.3.4</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>3.3.4</version>
</dependency>
spark-nlp on Apache Spark 2.4.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark24_2.11</artifactId>
<version>3.3.4</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
<version>3.3.4</version>
</dependency>
spark-nlp on Apache Spark 2.3.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark23_2.11</artifactId>
<version>3.3.4</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
<version>3.3.4</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.3.4.jar
-
GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.3.4.jar
-
CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.3.4.jar
-
GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.3.4.jar
-
CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.3.4.jar
-
GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.3.4.jar
What's Changed
- Update documentation of ChunkKeyPhraseExtraction by @vankov in #6508
- Fixes
new
instantiation in scala section by @josejuanmartinez in #6469 - Fix the wrong name for DistilBertForSequenceClassification in Python by @maziyarpanahi in #6513
- Release/334 release candidate by @maziyarpanahi in #6514
Full Changelog: 3.3.3...3.3.4
John Snow Labs Spark-NLP 3.3.3: New DistilBERT for Sequence Classification, new trainable and distributed Doc2Vec, BERT improvements on GPU, new state-of-the-art DistilBERT models for topic and sentiment detection, enhancements, and bug fixes!
Overview
(knock, knock, knock) Penny? Yes, this is a very special release if you are obsessed with the number 3
as much as we are! So we are pleased to announce Spark NLP π 3.3.3 release! π π π
This release comes with a new DistilBertForSequenceClassification annotator for existing or fine-tuned DistilBERT models for Text Classification on HuggingFace, new distributed and trainable Doc2Vec annotator based on Word2Vec implementation in Spark ML, improving BertEmbeddings and BertSentenceEmbeddings on a single machine on a GPU device where the DataFrame has 1 sentence per row or input column is set to document, new state-of-the-art fine-tuned DistilBERT models for Sequence Classification, enhancements, bug fixes, and more!
As always, we would like to thank our community for their feedback, questions, and feature requests.
New Features and Enhancements
- NEW: Introducing DistilBertForSequenceClassification annotator in Spark NLP π.
DistilBertForSequenceClassification
DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingDistilBertForSequenceClassification
orTFDistilBertForSequenceClassification
in HuggingFace π€ - NEW: Introducing trainable and distributed Doc2Vec annotators based on Word2Vec in Spark ML
- Improving BertEmbeddings for single document/sentence DataFrame per row on a single machine with a GPU device
- Improving BertSentenceEmbeddings for single document/sentence DataFrame per row on a single machine with a GPU device
- Add a new feature to the CoNLL() class, allowing it to read multiple CoNLL files at the same time into a single DataFrame
- Add support for Long type in label column for ClassifierDLApproach and SentimentDLApproach
- Add script to setup AWS SageMaker thanks to @xegulon
- Add instructions to setup Amazon Linux 2
Bug Fixes
- Improve models and pipelines resolutions in Spark NLP when wrong models/pipelines are downloaded regardless of their Apache Spark version
- Fix MarianTransformer bug on empty sequences
- Fix TFInvalidArgumentException in MarianTransformer for sequences longer than 512
- Fix MarianTransformer multi-lingual models and pipelines such as
opus_mt_mul_en
andopus_mt_mul_en
- Fix a bug in DateMatcher and MultiDateMatcher when detecting month from subwords by mistake
- Add the missing
lemma_antbnc
model to Models Hub - Add the missing
sentiment_vivekn
model to Models Hub - Add the missing
spellcheck_norvig
model to Models Hub
Models
New state-of-the-art fine-tuned DistilBERT models for Sequence Classification:
Featured Pretrained Models
Model | Name | Build | Lang |
---|---|---|---|
DistilBertForSequenceClassification | distilbert_sequence_classifier_sst2 | en |
3.3.3 |
DistilBertForSequenceClassification | distilbert_sequence_classifier_policy | en |
3.3.3 |
DistilBertForSequenceClassification | distilbert_sequence_classifier_industry | en |
3.3.3 |
DistilBertForSequenceClassification | distilbert_sequence_classifier_emotion | en |
3.3.3 |
DistilBertForSequenceClassification | distilbert_sequence_classifier_banking77 | en |
3.3.3 |
DistilBertForSequenceClassification | distilbert_multilingual_sequence_classifier_allocine | fr |
3.3.3 |
DistilBertForSequenceClassification | distilbert_base_sequence_classifier_imdb | ur |
3.3.3 |
DistilBertForSequenceClassification | distilbert_base_sequence_classifier_imdb | en |
3.3.3 |
DistilBertForSequenceClassification | distilbert_base_sequence_classifier_amazon_polarity | en |
3.3.3 |
DistilBertForSequenceClassification | distilbert_base_sequence_classifier_ag_news | en |
3.3.3 |
Doc2VecModel | doc2vec_gigaword_300 | en |
3.3.3 |
Doc2VecModel | doc2vec_gigaword_wiki_300 | en |
3.3.3 |
The complete list of all 4000+ models & pipelines in 200+ languages is available on Models Hub.
New Notebooks
Spark NLP | Notebooks | Colab |
---|---|---|
DistilBertForSequenceClassification | HuggingFace in Spark NLP - DistilBertForSequenceClassification | |
Doc2Vec | Train Doc2Vec for Text Classification |
Documentation
- TF Hub & HuggingFace to Spark NLP
- Models Hub with new models
- Spark NLP documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
- Spark NLP Workshop notebooks
- Spark NLP publications
- Spark NLP in Action
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP Display for visualization of different types of annotations
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Installation
Python
#PyPI
pip install spark-nlp==3.3.3
Spark Packages
spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.3
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.3
spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.3
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.3
spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.3
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.3.3
Maven
spark-nlp on Apache Spark 3.0.x and 3.1.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>3.3.3</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>3.3.3</version>
</dependency>
spark-nlp on Apache Spark 2.4.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark24_2.11</artifactId>
<version>3.3.3</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
<version>3.3.3</version>
</dependency>
spark-nlp on Apache Spark 2.3.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark23_2.11</artifactId>
<version>3....