John Snow Labs Spark-NLP 3.3.3: New DistilBERT for Sequence Classification, new trainable and distributed Doc2Vec, BERT improvements on GPU, new state-of-the-art DistilBERT models for topic and sentiment detection, enhancements, and bug fixes!
Overview
(knock, knock, knock) Penny? Yes, this is a very special release if you are obsessed with the number 3
as much as we are! So we are pleased to announce Spark NLP π 3.3.3 release! π π π
This release comes with a new DistilBertForSequenceClassification annotator for existing or fine-tuned DistilBERT models for Text Classification on HuggingFace, new distributed and trainable Doc2Vec annotator based on Word2Vec implementation in Spark ML, improving BertEmbeddings and BertSentenceEmbeddings on a single machine on a GPU device where the DataFrame has 1 sentence per row or input column is set to document, new state-of-the-art fine-tuned DistilBERT models for Sequence Classification, enhancements, bug fixes, and more!
As always, we would like to thank our community for their feedback, questions, and feature requests.
New Features and Enhancements
- NEW: Introducing DistilBertForSequenceClassification annotator in Spark NLP π.
DistilBertForSequenceClassification
DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by usingDistilBertForSequenceClassification
orTFDistilBertForSequenceClassification
in HuggingFace π€ - NEW: Introducing trainable and distributed Doc2Vec annotators based on Word2Vec in Spark ML
- Improving BertEmbeddings for single document/sentence DataFrame per row on a single machine with a GPU device
- Improving BertSentenceEmbeddings for single document/sentence DataFrame per row on a single machine with a GPU device
- Add a new feature to the CoNLL() class, allowing it to read multiple CoNLL files at the same time into a single DataFrame
- Add support for Long type in label column for ClassifierDLApproach and SentimentDLApproach
- Add script to setup AWS SageMaker thanks to @xegulon
- Add instructions to setup Amazon Linux 2
Bug Fixes
- Improve models and pipelines resolutions in Spark NLP when wrong models/pipelines are downloaded regardless of their Apache Spark version
- Fix MarianTransformer bug on empty sequences
- Fix TFInvalidArgumentException in MarianTransformer for sequences longer than 512
- Fix MarianTransformer multi-lingual models and pipelines such as
opus_mt_mul_en
andopus_mt_mul_en
- Fix a bug in DateMatcher and MultiDateMatcher when detecting month from subwords by mistake
- Add the missing
lemma_antbnc
model to Models Hub - Add the missing
sentiment_vivekn
model to Models Hub - Add the missing
spellcheck_norvig
model to Models Hub
Models
New state-of-the-art fine-tuned DistilBERT models for Sequence Classification:
Featured Pretrained Models
Model | Name | Build | Lang |
---|---|---|---|
DistilBertForSequenceClassification | distilbert_sequence_classifier_sst2 | en |
3.3.3 |
DistilBertForSequenceClassification | distilbert_sequence_classifier_policy | en |
3.3.3 |
DistilBertForSequenceClassification | distilbert_sequence_classifier_industry | en |
3.3.3 |
DistilBertForSequenceClassification | distilbert_sequence_classifier_emotion | en |
3.3.3 |
DistilBertForSequenceClassification | distilbert_sequence_classifier_banking77 | en |
3.3.3 |
DistilBertForSequenceClassification | distilbert_multilingual_sequence_classifier_allocine | fr |
3.3.3 |
DistilBertForSequenceClassification | distilbert_base_sequence_classifier_imdb | ur |
3.3.3 |
DistilBertForSequenceClassification | distilbert_base_sequence_classifier_imdb | en |
3.3.3 |
DistilBertForSequenceClassification | distilbert_base_sequence_classifier_amazon_polarity | en |
3.3.3 |
DistilBertForSequenceClassification | distilbert_base_sequence_classifier_ag_news | en |
3.3.3 |
Doc2VecModel | doc2vec_gigaword_300 | en |
3.3.3 |
Doc2VecModel | doc2vec_gigaword_wiki_300 | en |
3.3.3 |
The complete list of all 4000+ models & pipelines in 200+ languages is available on Models Hub.
New Notebooks
Spark NLP | Notebooks | Colab |
---|---|---|
DistilBertForSequenceClassification | HuggingFace in Spark NLP - DistilBertForSequenceClassification | |
Doc2Vec | Train Doc2Vec for Text Classification |
Documentation
- TF Hub & HuggingFace to Spark NLP
- Models Hub with new models
- Spark NLP documentation
- Spark NLP Scala APIs
- Spark NLP Python APIs
- Spark NLP Workshop notebooks
- Spark NLP publications
- Spark NLP in Action
- Spark NLP training certification notebooks for Google Colab and Databricks
- Spark NLP Display for visualization of different types of annotations
- Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Installation
Python
#PyPI
pip install spark-nlp==3.3.3
Spark Packages
spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.3
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.3
spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.3
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.3
spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.3
GPU
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.3.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.3.3
Maven
spark-nlp on Apache Spark 3.0.x and 3.1.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.12</artifactId>
<version>3.3.3</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu_2.12</artifactId>
<version>3.3.3</version>
</dependency>
spark-nlp on Apache Spark 2.4.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark24_2.11</artifactId>
<version>3.3.3</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
<version>3.3.3</version>
</dependency>
spark-nlp on Apache Spark 2.3.x:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-spark23_2.11</artifactId>
<version>3.3.3</version>
</dependency>
spark-nlp-gpu:
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
<version>3.3.3</version>
</dependency>
FAT JARs
-
CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.3.3.jar
-
GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.3.3.jar
-
CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.3.3.jar
-
GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.3.3.jar
-
CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.3.3.jar
-
GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.3.3.jar
What's Changed
Full Changelog: 3.3.2...3.3.3
New Contributors
@DevinTDHa @diatrambitas @xegulon @egenc @gadde5300 @jsl-models @murat-gunay @josejuanmartinez @maziyarpanahi @jsl-builder @wolliq @xusliebana @agsfer @danilojsl @vankov @muhammetsnts @albertoandreottiATgmail