Spark NLP 3.4.4: New DeBERTa for Token Classification, new CamemBERT embeddings, speed improvements for Tokenizer and UniversalSentenceEncoder annotators, over 160 new state-of-the-art models, and other improvements!
Overview
We are very excited to release Spark NLP π 3.4.4! This release comes with a new DeBERTa for Token Classification annotator compatible with existing or fine-tuned models on HuggingFace π€, a new annotator for CamemBERT embeddings models, up to 18x times improvements of UniversalSentenceEncoder on GPU devices, up to 400% speed improvements in Tokenizer with a list of exceptions, new state-of-the-art NER, French embeddings, DistilBERT embeddings, and ALBERT embeddings models!
As always, we would like to thank our community for their feedback, questions, and feature requests.
New Features
- NEW: Introducing DeBertaForTokenClassification annotator in Spark NLP π.
DeBertaForTokenClassification
can load DeBERTa v2&v3 models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by usingDebertaV2ForTokenClassification
for PyTorch orTFDebertaV2ForTokenClassification
for TensorFlow models in HuggingFace #8082 - NEW: Introducing CamemBertEmbeddings annotator in Spark NLP π. #8237 CamemBERT is a state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the newly available multilingual corpus OSCAR. For further information or requests, please go to Camembert Website
- Add support for batching rows to improve UniversalSentenceEncoder on GPU devices. This new feature will increase GPU speed between 2x to 18x times depending on the distribution of sentences #8234
Bug Fixes & Enhancements
- Optimizing Tokenizer performance up to 400% when there is an exceptions list. We have improved the
exceptions list
to be scalable to a large number of exceptions without impacting the overall performance #7881 - Support latest PySpark releases in Colab, Kaggle, and SageMaker scripts #8028
- Fix bug that caused get input/output/LazyAnnotator to return None #8043
- Fix DeBertaForSequenceClassification in Python failing to load pretrained models #8060
- Fix missing Lemma and POS models from 3.4.3 release
Dependencies
- Removing outdated trove4j dependency in favour of native Java modules #8236
- Upgrade the base Apache Spark to
2.4.8
,3.0.3
, and3.2.1
- Upgrade type typesafe config to
1.4.2
- Upgrade sbt to
1.6.2
Models
Spark NLP 3.4.4 comes with over 160+ state-of-the-art multi-lingual pretrained models. Some of the featured models:
New DeBERTa Token Classification Models
New fine-tuned DeBERTa v3 models for token classifications over CoNLL03 and OntoNotes datasets that reach state-of-the-art metrics.
Model | Name | Lang | F1 Dev |
---|---|---|---|
DeBertaForTokenClassification | deberta_v3_large_token_classifier_conll03 | en |
0.97 |
DeBertaForTokenClassification | deberta_v3_base_token_classifier_conll03 | en |
0.96 |
DeBertaForTokenClassification | deberta_v3_small_token_classifier_conll03 | en |
0.95 |
DeBertaForTokenClassification | deberta_v3_xsmall_token_classifier_conll03 | en |
0.93 |
DeBertaForTokenClassification | deberta_v3_large_token_classifier_ontonotes | en |
0.89 |
DeBertaForTokenClassification | deberta_v3_base_token_classifier_ontonotes | en |
0.88 |
DeBertaForTokenClassification | deberta_v3_small_token_classifier_ontonotes | en |
0.87 |
DeBertaForTokenClassification | deberta_v3_xsmall_token_classifier_ontonotes | en |
0.86 |
New CamemBERT Models
Model | Name | Lang |
---|---|---|
CamemBertEmbeddings | camembert_large | fr |
CamemBertEmbeddings | camembert_base | fr |
CamemBertEmbeddings | camembert_base_ccnet_4gb | fr |
CamemBertEmbeddings | camembert_base_ccnet | fr |
CamemBertEmbeddings | camembert_base_oscar_4gb | fr |
CamemBertEmbeddings | camembert_base_wikipedia_4gb | fr |
New DistilBERT Embeddings Models
Model | Name | Lang |
---|---|---|
DistilBertEmbeddings | distilbert_embeddings_distilbert_base_fr_cased | fr |
DistilBertEmbeddings | distilbert_embeddings_marathi_distilbert | mr |
DistilBertEmbeddings | distilbert_embeddings_distilbert_base_indonesian | id |
DistilBertEmbeddings | distilbert_embeddings_javanese_distilbert_small | jv |
DistilBertEmbeddings | distilbert_embeddings_malaysian_distilbert_small | ms |
DistilBertEmbeddings | distilbert_embeddings_distilbert_base_ar_cased | ar |
New ALBERT Embeddings Models
Model | Name | Lang |
---|---|---|
AlbertEmbeddings | albert_embeddings_fralbert_base | fr |
AlbertEmbeddings | albert_embeddings_albert_base_arabic | ar |
AlbertEmbeddings | albert_embeddings_marathi_albert_v2 | mr |
AlbertEmbeddings | albert_embeddings_albert_fa_base_v2 | fa |
AlbertEmbeddings | albert_embeddings_albert_large_bahasa_cased | ms |
AlbertEmbeddings | albert_embeddings_marathi_albert | mr |
The complete list of all 5000+ models & pipelines in 200+ languages is available on Models Hub.
New Notebooks
Import CamemBERT models to Spark NLP π
Spark NLP | HuggingFace Notebooks | Colab |
---|---|---|
CamemBertEmbeddings | HuggingFace in Spark NLP - CamemBERT |
You can visit Import Transformers in Spark NLP for more info
Documentation
- TF Hub & HuggingFace to Spark NLP
- Models Hub with new models
- Spark NLP documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
- Spark NLP Workshop notebooks
- Spark NLP publications
- Spark NLP in Action
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP Display for visualization of different types of annotations
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Installation
Python
#PyPI
pip install spark-nlp==3.4.4
Spark Packages
spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.4
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.4
spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.4
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.4
spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.4
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.4
spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.4
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.4
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.4
Maven
spark-nlp on Apache Spark 3.0.x and 3.1.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>3.4.4</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>3.4.4</version>
</dependency>
spark-nlp on Apache Spark 3.2.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark32_2.12</artifactId>
<version>3.4.4</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark32_2.12</artifactId>
<version>3.4.4</version>
</dependency>
spark-nlp on Apache Spark 2.4.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark24_2.11</artifactId>
<version>3.4.4</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
<version>3.4.4</version>
</dependency>
spark-nlp on Apache Spark 2.3.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark23_2.11</artifactId>
<version>3.4.4</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
<version>3.4.4</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.4.4.jar
-
GPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.4.4.jar
-
CPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark32-assembly-3.4.4.jar
-
GPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark32-assembly-3.4.4.jar
-
CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.4.4.jar
-
GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.4.4.jar
-
CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.4.4.jar
-
GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.4.4.jar
What's Changed
Full Changelog: 3.4.3...3.4.4
New Contributors
- @aymanechilah made their first contribution in #6956
@xusliebana @Ahmetemintek @jsl-models @Meryem1425 @mahmoodbayeshi @aymanechilah @DevinTDHa @agsfer @rpranab @C-K-Loan @maziyarpanahi @Damla-Gurbaz @danilojsl @luca-martial @muhammetsnts @josejuanmartinez @bunyamin-polat @galiph @jsl-builder @albertoandreottiATgmail