Release Spark NLP 3.4.4: New DeBERTa for Token Classification, new CamemBERT embeddings, speed improvements for Tokenizer and UniversalSentenceEncoder annotators, over 160 new state-of-the-art models, and other improvements! · JohnSnowLabs/spark-nlp

Overview

We are very excited to release Spark NLP 🚀 3.4.4! This release comes with a new DeBERTa for Token Classification annotator compatible with existing or fine-tuned models on HuggingFace 🤗, a new annotator for CamemBERT embeddings models, up to 18x times improvements of UniversalSentenceEncoder on GPU devices, up to 400% speed improvements in Tokenizer with a list of exceptions, new state-of-the-art NER, French embeddings, DistilBERT embeddings, and ALBERT embeddings models!

As always, we would like to thank our community for their feedback, questions, and feature requests.

New Features

NEW: Introducing DeBertaForTokenClassification annotator in Spark NLP 🚀. DeBertaForTokenClassification can load DeBERTa v2&v3 models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using DebertaV2ForTokenClassification for PyTorch or TFDebertaV2ForTokenClassification for TensorFlow models in HuggingFace #8082
NEW: Introducing CamemBertEmbeddings annotator in Spark NLP 🚀. #8237 CamemBERT is a state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the newly available multilingual corpus OSCAR. For further information or requests, please go to Camembert Website
Add support for batching rows to improve UniversalSentenceEncoder on GPU devices. This new feature will increase GPU speed between 2x to 18x times depending on the distribution of sentences #8234

Bug Fixes & Enhancements

Optimizing Tokenizer performance up to 400% when there is an exceptions list. We have improved the exceptions list to be scalable to a large number of exceptions without impacting the overall performance #7881
Support latest PySpark releases in Colab, Kaggle, and SageMaker scripts #8028
Fix bug that caused get input/output/LazyAnnotator to return None #8043
Fix DeBertaForSequenceClassification in Python failing to load pretrained models #8060
Fix missing Lemma and POS models from 3.4.3 release

Dependencies

Removing outdated trove4j dependency in favour of native Java modules #8236
Upgrade the base Apache Spark to 2.4.8, 3.0.3, and 3.2.1
Upgrade type typesafe config to 1.4.2
Upgrade sbt to 1.6.2

Models

Spark NLP 3.4.4 comes with over 160+ state-of-the-art multi-lingual pretrained models. Some of the featured models:

New DeBERTa Token Classification Models

New fine-tuned DeBERTa v3 models for token classifications over CoNLL03 and OntoNotes datasets that reach state-of-the-art metrics.

Model	Name	Lang	F1 Dev
DeBertaForTokenClassification	deberta_v3_large_token_classifier_conll03	`en`	`0.97`
DeBertaForTokenClassification	deberta_v3_base_token_classifier_conll03	`en`	`0.96`
DeBertaForTokenClassification	deberta_v3_small_token_classifier_conll03	`en`	`0.95`
DeBertaForTokenClassification	deberta_v3_xsmall_token_classifier_conll03	`en`	`0.93`
DeBertaForTokenClassification	deberta_v3_large_token_classifier_ontonotes	`en`	`0.89`
DeBertaForTokenClassification	deberta_v3_base_token_classifier_ontonotes	`en`	`0.88`
DeBertaForTokenClassification	deberta_v3_small_token_classifier_ontonotes	`en`	`0.87`
DeBertaForTokenClassification	deberta_v3_xsmall_token_classifier_ontonotes	`en`	`0.86`

New CamemBERT Models

Model	Name	Lang
CamemBertEmbeddings	camembert_large	`fr`
CamemBertEmbeddings	camembert_base	`fr`
CamemBertEmbeddings	camembert_base_ccnet_4gb	`fr`
CamemBertEmbeddings	camembert_base_ccnet	`fr`
CamemBertEmbeddings	camembert_base_oscar_4gb	`fr`
CamemBertEmbeddings	camembert_base_wikipedia_4gb	`fr`

New DistilBERT Embeddings Models

Model	Name	Lang
DistilBertEmbeddings	distilbert_embeddings_distilbert_base_fr_cased	`fr`
DistilBertEmbeddings	distilbert_embeddings_marathi_distilbert	`mr`
DistilBertEmbeddings	distilbert_embeddings_distilbert_base_indonesian	`id`
DistilBertEmbeddings	distilbert_embeddings_javanese_distilbert_small	`jv`
DistilBertEmbeddings	distilbert_embeddings_malaysian_distilbert_small	`ms`
DistilBertEmbeddings	distilbert_embeddings_distilbert_base_ar_cased	`ar`

New ALBERT Embeddings Models

Model	Name	Lang
AlbertEmbeddings	albert_embeddings_fralbert_base	`fr`
AlbertEmbeddings	albert_embeddings_albert_base_arabic	`ar`
AlbertEmbeddings	albert_embeddings_marathi_albert_v2	`mr`
AlbertEmbeddings	albert_embeddings_albert_fa_base_v2	`fa`
AlbertEmbeddings	albert_embeddings_albert_large_bahasa_cased	`ms`
AlbertEmbeddings	albert_embeddings_marathi_albert	`mr`

The complete list of all 5000+ models & pipelines in 200+ languages is available on Models Hub.

New Notebooks

Import CamemBERT models to Spark NLP 🚀

Spark NLP	HuggingFace Notebooks	Colab
CamemBertEmbeddings	HuggingFace in Spark NLP - CamemBERT

You can visit Import Transformers in Spark NLP for more info

Documentation

TF Hub & HuggingFace to Spark NLP
Models Hub with new models
Spark NLP documentation
Spark NLP Scala APIs
Spark NLP Python APIs
Spark NLP Workshop notebooks
Spark NLP publications
Spark NLP in Action
Spark NLP training certification notebooks for Google Colab and Databricks
Spark NLP Display for visualization of different types of annotations
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!

Installation

Python

#PyPI

pip install spark-nlp==3.4.4

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.4

spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.4

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.4

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.4

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>3.4.4</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>3.4.4</version>
</dependency>

spark-nlp on Apache Spark 3.2.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark32_2.12</artifactId>
    <version>3.4.4</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark32_2.12</artifactId>
    <version>3.4.4</version>
</dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark24_2.11</artifactId>
    <version>3.4.4</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
    <version>3.4.4</version>
</dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark23_2.11</artifactId>
    <version>3.4.4</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
    <version>3.4.4</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.4.4.jar
GPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.4.4.jar
CPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark32-assembly-3.4.4.jar
GPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark32-assembly-3.4.4.jar
CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.4.4.jar
GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.4.4.jar
CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.4.4.jar
GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.4.4.jar

What's Changed

Full Changelog: 3.4.3...3.4.4

New Contributors

@aymanechilah made their first contribution in #6956

@xusliebana @Ahmetemintek @jsl-models @Meryem1425 @mahmoodbayeshi @aymanechilah @DevinTDHa @agsfer @rpranab @C-K-Loan @maziyarpanahi @Damla-Gurbaz @danilojsl @luca-martial @muhammetsnts @josejuanmartinez @bunyamin-polat @galiph @jsl-builder @albertoandreottiATgmail

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark NLP 3.4.4: New DeBERTa for Token Classification, new CamemBERT embeddings, speed improvements for Tokenizer and UniversalSentenceEncoder annotators, over 160 new state-of-the-art models, and other improvements!

Overview

New Features

Bug Fixes & Enhancements

Dependencies

Models

New DeBERTa Token Classification Models

New CamemBERT Models

New DistilBERT Embeddings Models

New ALBERT Embeddings Models

New Notebooks

Documentation

Installation

What's Changed

New Contributors

Contributors