Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Release 1.0.3

Scott Graham edited this page Jan 29, 2020 · 1 revision

MASC Release 1.0.3

Microsoft MASC, an Apache Spark connector for Apache Accumulo version 1.0.3 has been released. MASC integrates Apache Spark and Apache Accumulo to leverage the rich Spark Machine Learning eco-system with scalable and secure data storage capabilities of Accumulo. This work is publicly available under the Apache License 2.0 on GitHub at https://github.com/microsoft/masc. Feedback, questions, and contributions are welcome. Usage

PySpark based example is here: Accumulo-Spark Connector Demo Notebook.

Connector documentation: https://github.com/microsoft/masc/blob/master/connector/README.md

JARs available on Maven Central Repository:

Major Features

  • Simplified Spark DataFrame read/write to Accumulo using DataSource v2 API
  • Speedup of 2-5x over existing approaches for pulling key-value data into DataFrame format
  • Scala and Python support without overhead for moving between languages
  • Process streaming data from Accumulo without loading it all into Spark memory
  • Push down filtering with a flexible expression language (JUEL): this allows the user to use logical operators and comparisons to reduce the amount of data returned from Accumulo
  • Column pruning based on selected fields transparently reduces the amount of data returned from Accumulo
  • Server side inference: this allows the Accumulo nodes to be used to run ML model inference using MLeap to increase the scalability of AI solutions as well as keeping data in Accumulo.

Known Issues

  • [37] Support SaveMode when writing DataFrames

Contributions Thanks to contributions from members on the Azure Government Customer Engineering and Azure Government teams. Markus Cozowicz, Scott Graham, Jun-Ki Min, Chenhui Hu, Arvind Shyamsundar, Marc Parisi, Billie Rinaldi, Anupam Sharma, Tao Wu and Pavandeep Kalra.

Clone this wiki locally