Skip to content

John Snow Labs Spark-NLP 3.3.3: New DistilBERT for Sequence Classification, new trainable and distributed Doc2Vec, BERT improvements on GPU, new state-of-the-art DistilBERT models for topic and sentiment detection, enhancements, and bug fixes!

Compare
Choose a tag to compare
@maziyarpanahi maziyarpanahi released this 22 Nov 18:37
· 2384 commits to master since this release

Overview

(knock, knock, knock) Penny? Yes, this is a very special release if you are obsessed with the number 3 as much as we are! So we are pleased to announce Spark NLP πŸš€ 3.3.3 release! πŸŽ‰ 🎊 🎈

This release comes with a new DistilBertForSequenceClassification annotator for existing or fine-tuned DistilBERT models for Text Classification on HuggingFace, new distributed and trainable Doc2Vec annotator based on Word2Vec implementation in Spark ML, improving BertEmbeddings and BertSentenceEmbeddings on a single machine on a GPU device where the DataFrame has 1 sentence per row or input column is set to document, new state-of-the-art fine-tuned DistilBERT models for Sequence Classification, enhancements, bug fixes, and more!

As always, we would like to thank our community for their feedback, questions, and feature requests.


New Features and Enhancements

  • NEW: Introducing DistilBertForSequenceClassification annotator in Spark NLP πŸš€. DistilBertForSequenceClassification DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using DistilBertForSequenceClassification or TFDistilBertForSequenceClassification in HuggingFace πŸ€—
  • NEW: Introducing trainable and distributed Doc2Vec annotators based on Word2Vec in Spark ML
  • Improving BertEmbeddings for single document/sentence DataFrame per row on a single machine with a GPU device
  • Improving BertSentenceEmbeddings for single document/sentence DataFrame per row on a single machine with a GPU device
  • Add a new feature to the CoNLL() class, allowing it to read multiple CoNLL files at the same time into a single DataFrame
  • Add support for Long type in label column for ClassifierDLApproach and SentimentDLApproach
  • Add script to setup AWS SageMaker thanks to @xegulon
  • Add instructions to setup Amazon Linux 2

Bug Fixes

  • Improve models and pipelines resolutions in Spark NLP when wrong models/pipelines are downloaded regardless of their Apache Spark version
  • Fix MarianTransformer bug on empty sequences
  • Fix TFInvalidArgumentException in MarianTransformer for sequences longer than 512
  • Fix MarianTransformer multi-lingual models and pipelines such as opus_mt_mul_en and opus_mt_mul_en
  • Fix a bug in DateMatcher and MultiDateMatcher when detecting month from subwords by mistake
  • Add the missing lemma_antbnc model to Models Hub
  • Add the missing sentiment_vivekn model to Models Hub
  • Add the missing spellcheck_norvig model to Models Hub

Models

New state-of-the-art fine-tuned DistilBERT models for Sequence Classification:

Featured Pretrained Models

Model Name Build Lang
DistilBertForSequenceClassification distilbert_sequence_classifier_sst2 en 3.3.3
DistilBertForSequenceClassification distilbert_sequence_classifier_policy en 3.3.3
DistilBertForSequenceClassification distilbert_sequence_classifier_industry en 3.3.3
DistilBertForSequenceClassification distilbert_sequence_classifier_emotion en 3.3.3
DistilBertForSequenceClassification distilbert_sequence_classifier_banking77 en 3.3.3
DistilBertForSequenceClassification distilbert_multilingual_sequence_classifier_allocine fr 3.3.3
DistilBertForSequenceClassification distilbert_base_sequence_classifier_imdb ur 3.3.3
DistilBertForSequenceClassification distilbert_base_sequence_classifier_imdb en 3.3.3
DistilBertForSequenceClassification distilbert_base_sequence_classifier_amazon_polarity en 3.3.3
DistilBertForSequenceClassification distilbert_base_sequence_classifier_ag_news en 3.3.3
Doc2VecModel doc2vec_gigaword_300 en 3.3.3
Doc2VecModel doc2vec_gigaword_wiki_300 en 3.3.3

The complete list of all 4000+ models & pipelines in 200+ languages is available on Models Hub.

New Notebooks

Spark NLP Notebooks Colab
DistilBertForSequenceClassification HuggingFace in Spark NLP - DistilBertForSequenceClassification Open In Colab
Doc2Vec Train Doc2Vec for Text Classification Open In Colab

Documentation


Installation

Python

#PyPI

pip install spark-nlp==3.3.3

Spark Packages

spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.3

spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.3

spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.3.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.3.3

Maven

spark-nlp on Apache Spark 3.0.x and 3.1.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>3.3.3</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>3.3.3</version>
</dependency>

spark-nlp on Apache Spark 2.4.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark24_2.11</artifactId>
    <version>3.3.3</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
    <version>3.3.3</version>
</dependency>

spark-nlp on Apache Spark 2.3.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark23_2.11</artifactId>
    <version>3.3.3</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
    <version>3.3.3</version>
</dependency>

FAT JARs

What's Changed

Full Changelog: 3.3.2...3.3.3

New Contributors

@DevinTDHa @diatrambitas @xegulon @egenc @gadde5300 @jsl-models @murat-gunay @josejuanmartinez @maziyarpanahi @jsl-builder @wolliq @xusliebana @agsfer @danilojsl @vankov @muhammetsnts @albertoandreottiATgmail