Skip to content

Spark NLP 3.4.4: New DeBERTa for Token Classification, new CamemBERT embeddings, speed improvements for Tokenizer and UniversalSentenceEncoder annotators, over 160 new state-of-the-art models, and other improvements!

Compare
Choose a tag to compare
@maziyarpanahi maziyarpanahi released this 06 May 13:49
· 1809 commits to master since this release

Overview

We are very excited to release Spark NLP πŸš€ 3.4.4! This release comes with a new DeBERTa for Token Classification annotator compatible with existing or fine-tuned models on HuggingFace πŸ€—, a new annotator for CamemBERT embeddings models, up to 18x times improvements of UniversalSentenceEncoder on GPU devices, up to 400% speed improvements in Tokenizer with a list of exceptions, new state-of-the-art NER, French embeddings, DistilBERT embeddings, and ALBERT embeddings models!

As always, we would like to thank our community for their feedback, questions, and feature requests.


New Features

  • NEW: Introducing DeBertaForTokenClassification annotator in Spark NLP πŸš€. DeBertaForTokenClassification can load DeBERTa v2&v3 models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using DebertaV2ForTokenClassification for PyTorch or TFDebertaV2ForTokenClassification for TensorFlow models in HuggingFace #8082
  • NEW: Introducing CamemBertEmbeddings annotator in Spark NLP πŸš€. #8237 CamemBERT is a state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the newly available multilingual corpus OSCAR. For further information or requests, please go to Camembert Website
  • Add support for batching rows to improve UniversalSentenceEncoder on GPU devices. This new feature will increase GPU speed between 2x to 18x times depending on the distribution of sentences #8234

Bug Fixes & Enhancements

  • Optimizing Tokenizer performance up to 400% when there is an exceptions list. We have improved the exceptions list to be scalable to a large number of exceptions without impacting the overall performance #7881
  • Support latest PySpark releases in Colab, Kaggle, and SageMaker scripts #8028
  • Fix bug that caused get input/output/LazyAnnotator to return None #8043
  • Fix DeBertaForSequenceClassification in Python failing to load pretrained models #8060
  • Fix missing Lemma and POS models from 3.4.3 release

Dependencies

  • Removing outdated trove4j dependency in favour of native Java modules #8236
  • Upgrade the base Apache Spark to 2.4.8, 3.0.3, and 3.2.1
  • Upgrade type typesafe config to 1.4.2
  • Upgrade sbt to 1.6.2

Models

Spark NLP 3.4.4 comes with over 160+ state-of-the-art multi-lingual pretrained models. Some of the featured models:

New DeBERTa Token Classification Models

New fine-tuned DeBERTa v3 models for token classifications over CoNLL03 and OntoNotes datasets that reach state-of-the-art metrics.

Model Name Lang F1 Dev
DeBertaForTokenClassification deberta_v3_large_token_classifier_conll03 en 0.97
DeBertaForTokenClassification deberta_v3_base_token_classifier_conll03 en 0.96
DeBertaForTokenClassification deberta_v3_small_token_classifier_conll03 en 0.95
DeBertaForTokenClassification deberta_v3_xsmall_token_classifier_conll03 en 0.93
DeBertaForTokenClassification deberta_v3_large_token_classifier_ontonotes en 0.89
DeBertaForTokenClassification deberta_v3_base_token_classifier_ontonotes en 0.88
DeBertaForTokenClassification deberta_v3_small_token_classifier_ontonotes en 0.87
DeBertaForTokenClassification deberta_v3_xsmall_token_classifier_ontonotes en 0.86

New CamemBERT Models

Model Name Lang
CamemBertEmbeddings camembert_large fr
CamemBertEmbeddings camembert_base fr
CamemBertEmbeddings camembert_base_ccnet_4gb fr
CamemBertEmbeddings camembert_base_ccnet fr
CamemBertEmbeddings camembert_base_oscar_4gb fr
CamemBertEmbeddings camembert_base_wikipedia_4gb fr

New DistilBERT Embeddings Models

Model Name Lang
DistilBertEmbeddings distilbert_embeddings_distilbert_base_fr_cased fr
DistilBertEmbeddings distilbert_embeddings_marathi_distilbert mr
DistilBertEmbeddings distilbert_embeddings_distilbert_base_indonesian id
DistilBertEmbeddings distilbert_embeddings_javanese_distilbert_small jv
DistilBertEmbeddings distilbert_embeddings_malaysian_distilbert_small ms
DistilBertEmbeddings distilbert_embeddings_distilbert_base_ar_cased ar

New ALBERT Embeddings Models

Model Name Lang
AlbertEmbeddings albert_embeddings_fralbert_base fr
AlbertEmbeddings albert_embeddings_albert_base_arabic ar
AlbertEmbeddings albert_embeddings_marathi_albert_v2 mr
AlbertEmbeddings albert_embeddings_albert_fa_base_v2 fa
AlbertEmbeddings albert_embeddings_albert_large_bahasa_cased ms
AlbertEmbeddings albert_embeddings_marathi_albert mr

The complete list of all 5000+ models & pipelines in 200+ languages is available on Models Hub.

New Notebooks

Import CamemBERT models to Spark NLP πŸš€

Spark NLP HuggingFace Notebooks Colab
CamemBertEmbeddings HuggingFace in Spark NLP - CamemBERT Open In Colab

You can visit Import Transformers in Spark NLP for more info


Documentation


Installation

Python

#PyPI

pip install spark-nlp==3.4.4

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.4

spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.4

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.4

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.4

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.4

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.4

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>3.4.4</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>3.4.4</version>
</dependency>

spark-nlp on Apache Spark 3.2.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark32_2.12</artifactId>
    <version>3.4.4</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark32_2.12</artifactId>
    <version>3.4.4</version>
</dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark24_2.11</artifactId>
    <version>3.4.4</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
    <version>3.4.4</version>
</dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark23_2.11</artifactId>
    <version>3.4.4</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
    <version>3.4.4</version>
</dependency>

FAT JARs

What's Changed

Full Changelog: 3.4.3...3.4.4

New Contributors

@xusliebana @Ahmetemintek @jsl-models @Meryem1425 @mahmoodbayeshi @aymanechilah @DevinTDHa @agsfer @rpranab @C-K-Loan @maziyarpanahi @Damla-Gurbaz @danilojsl @luca-martial @muhammetsnts @josejuanmartinez @bunyamin-polat @galiph @jsl-builder @albertoandreottiATgmail