mlpack provides numerous machine learning algorithms in easy-to-use, compact, and portable C++ code that is easy to deploy. Getting data into mlpack can be straightforward, for example from csv files. These days, data can live in many places and formats and many excellent data readers are available: duckdb, polars, and many more are available and generally provide arrow (columnar) binary representation interfaces making the choice of data reader less relevant as the interface is what matters.
Here we examine the use of the neat nanoarrow package and its use of the (portable,
lightweight, no need for libarrow
) C data inteface. In particular, we look at growing
matrices (and vectors) from the 'streamed' representation available from (nano)arrow.
Our first example accesses the mlpack data for its introductory random forest
example directly off the compressed files on the website into duckdb (using standard
in-memory representation), exports to [nanoarrow][nanarrow] array streams (also in memory) which
are then converted to [armadillo][armadillo] matrices which mlpack uses. (We note that
while mlpack is heavily templated, the standard representation is still double
which
simplifies the interface; extensions are possible/planned).
But more importantly
- no data is ever manifested on disk, the example can live 'on the edge' in pure compute nodes
- while driven from R (because that is what I like) all the data sits in Arrow types allowing a
fuller vocabulary of types where needed (i.e.
uint16_t
or any type other than the default signed integer, ditto forfloat
). - this should be easily extensible to 'Arrow over RPC'
A second example redoes this from polars; a third example directly from arrow.
A fourth example shows how to load the NYC 'flights' dataset, demonstrating that fuller
data frame objects can be loaded too (as we add a simple 'turn to factor levels' converter, care
must of course be taken interpreting these levels as double
variables).
The code is provided as an R package so a standard installation from the repository via
remotes::install_packages("naarma")
work. The only dependencies to install the package are three other standard packages, namely Rcpp
,
RcppArmadillo
, nanoarrow
.
Moreover, the package can be installed (as binary, where available, or source) from its r-universe repository via
urls <- c("https://eddelbuettel.r-universe.dev", "https://cloud.r-project.org")
install.packages('naarma', repos = urls)
and is also available as an Ubuntu binary, see the docs for that).
Dirk Eddelbuettel
GPL (>= 2)