Skip to content

Integration of nanoarrow and armadillo for use with mlpack

Notifications You must be signed in to change notification settings

eddelbuettel/naarma

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

naarma: nanoarrow to armadillo integration

CI License r-universe

Motivation

mlpack provides numerous machine learning algorithms in easy-to-use, compact, and portable C++ code that is easy to deploy. Getting data into mlpack can be straightforward, for example from csv files. These days, data can live in many places and formats and many excellent data readers are available: duckdb, polars, and many more are available and generally provide arrow (columnar) binary representation interfaces making the choice of data reader less relevant as the interface is what matters.

Here we examine the use of the neat nanoarrow package and its use of the (portable, lightweight, no need for libarrow) C data inteface. In particular, we look at growing matrices (and vectors) from the 'streamed' representation available from (nano)arrow.

Our first example accesses the mlpack data for its introductory random forest example directly off the compressed files on the website into duckdb (using standard in-memory representation), exports to [nanoarrow][nanarrow] array streams (also in memory) which are then converted to [armadillo][armadillo] matrices which mlpack uses. (We note that while mlpack is heavily templated, the standard representation is still double which simplifies the interface; extensions are possible/planned).

But more importantly

  • no data is ever manifested on disk, the example can live 'on the edge' in pure compute nodes
  • while driven from R (because that is what I like) all the data sits in Arrow types allowing a fuller vocabulary of types where needed (i.e. uint16_t or any type other than the default signed integer, ditto for float).
  • this should be easily extensible to 'Arrow over RPC'

A second example redoes this from polars; a third example directly from arrow.

A fourth example shows how to load the NYC 'flights' dataset, demonstrating that fuller data frame objects can be loaded too (as we add a simple 'turn to factor levels' converter, care must of course be taken interpreting these levels as double variables).

Installation

The code is provided as an R package so a standard installation from the repository via

remotes::install_packages("naarma")

work. The only dependencies to install the package are three other standard packages, namely Rcpp, RcppArmadillo, nanoarrow.

Moreover, the package can be installed (as binary, where available, or source) from its r-universe repository via

urls <- c("https://eddelbuettel.r-universe.dev", "https://cloud.r-project.org")
install.packages('naarma', repos = urls)

and is also available as an Ubuntu binary, see the docs for that).

Author

Dirk Eddelbuettel

License

GPL (>= 2)

About

Integration of nanoarrow and armadillo for use with mlpack

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published