--- title: "Translating complex Presto data types" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Translating complex Presto data types} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = RPresto::presto_has_default() ) ``` In this vignette, we demonstrate how complex structural types in Presto can be translated into R types (e.g., vectors, list, and tibbles). ## Package setup ```{r setup} library(RPresto) ``` You can check your `RPresto` version by running the `packageVersion()` function. **You need version 1.3.9 or later to have a more comprehensive and robust complex types support.** ```{r version} packageVersion("RPresto") ``` ## Overview Complex types refer to [structural types][1] including `ARRAY`, `MAP` and `ROW`. Those data types are basically containers to hold other data types (hence *complex*). We summarize the similarities and differences between the complex types below. | Type | Is atomic? | Has names/keys? | |------|:----------:|:---------------:| | ARRAY | Yes | No | | MAP | Yes | Yes | | ROW | No | Yes | Atomic here means all elements in the container share the same type (usually primitive types, but can be complex types too). For example, an `ARRAY` of integer can only hold integer type elements whereas a `ROW` can have elements of different types (e.g., one element is integer and the other boolean). ## Walkthrough preparation ### Local Presto server in memory We assume that the user already have a Presto server with a memory connector set up. If you don't have such a server set up, refer to the [Presto documentation][2] for instructions if you want to follow along. ### Presto connection in R We first create a `PrestoConnection` which will serve as the bridge between the Presto server and R. ```{r PrestoConnection} con <- DBI::dbConnect( drv = RPresto::Presto(), host = "http://localhost", port = 8080, user = Sys.getenv("USER"), catalog = "memory", schema = "default" ) ``` We first issue a simple query to see if the Presto connection is working properly. ```{r test} DBI::dbGetQuery(con, "SELECT 1+1 AS res") ``` ## Atomic structural types that contain primitive types | Presto type | R type | |-------------|:------:| | ARRAY | unnamed typed vector | | MAP | named typed vector | ### `ARRAY`s Given the atomic and unnamed nature of `ARRAY`s, we map them to **unnamed typed vectors** in R. #### Create a table with `ARRAY`s of primitive types We first create a table with `ARRAY`s of all supported primitive Presto data types using the `create_primitive_arrays_table()` function included in the `RPresto` package. ```{r primitive-arrays-table} RPresto:::create_primitive_arrays_table( con, table_name = "presto_primitive_arrays", verbose = FALSE ) ``` We can check if the table now exists in Presto. ```{r check-primitive_arrays-table} DBI::dbExistsTable(con, "presto_primitive_arrays") ``` We can list the fields in the table. They are named after the Presto types they represent. ```{r primitive-arrays-table-fields} DBI::dbListFields(con, "presto_primitive_arrays") ``` #### Translate `ARRAY`s to R vectors ```{r array-of-primitive-types} ( df.array_of_primitive_types <- dbGetQuery( con, "SELECT * FROM presto_primitive_arrays", bigint = "integer64" ) ) ``` We can verify the R types of each column. ```{r array-of-primitive-type-check} tibble::enframe(purrr::map_chr(df.array_of_primitive_types, ~class(.[[1]])[1])) ``` All vectors are unnamed. ```{r array-of-primitive-name-check} purrr::every(df.array_of_primitive_types, ~is.null(names(.[[1]]))) ``` We can also call functions such as `length()` on each of the columns to get the `ARRAY` cardinality. It shows that all `ARRAY`s have 3 elements in them. ```{r array-cardinality} tibble::enframe(purrr::map_int(df.array_of_primitive_types, ~length(.[[1]]))) ``` ### `MAP`s A `MAP` in Presto can be thought as a combination of two same-length `ARRAY`s. The first `ARRAY` contains the keys of the `MAP` and the second contains the values of the `MAP`. In fact, that's exactly how `MAP` literals are created in Presto (e.g., `MAP(ARRAY[1, 2], ARRAY['a', 'b'])` creates a 2-element `MAP`). Following the logic, we translate `MAP`s to **named typed vectors** in R. #### Create a table with `MAP`S of primitive types We first create a table with `MAPS`s of all supported primitive Presto data types using the `create_primitive_maps_table()` function included in the `RPresto` package. ```{r primitive-maps-table} RPresto:::create_primitive_maps_table( con, table_name = "presto_primitive_maps", verbose = FALSE ) ``` We can check if the table now exists in Presto. ```{r check-primitive_maps-table} DBI::dbExistsTable(con, "presto_primitive_maps") ``` We can list the fields in the table. They are named after the Presto types they represent. ```{r primitive-maps-table-fields} DBI::dbListFields(con, "presto_primitive_maps") ``` #### Translate `MAP`s to R vectors ```{r map-of-primitive-types} ( df.map_of_primitive_types <- dbGetQuery( con, "SELECT * FROM presto_primitive_maps", bigint = "integer64" ) ) ``` We can verify the R types of each column. ```{r map-of-primitive-type-check} tibble::enframe(purrr::map_chr(df.map_of_primitive_types, ~class(.[[1]])[1])) ``` All vectors are named. ```{r map-of-primitive-name-check} purrr::none(df.map_of_primitive_types, ~is.null(names(.[[1]]))) ``` ### Repeated `ARRAY`s and `MAP`s It's possible to have repeated `ARRAY`s and `MAP`s in Presto in the form of `ARRAY`s of `ARRAY`s and `ARRAY`s of `MAP`s. | Repeated Presto type | R type | |----------------------|--------| | ARRAY of ARRAY | **not supported** | | ARRAY of MAP | unnamed list of named typed vectors | We are not supporting nested `ARRAY`s at the moment although it's technically possible in Presto. For `ARRAY`s of `MAP`s, we translate the `ARRAY` container into an unnamed list and each of the `MAP` element into a named typed vector. We first create an **array-of-maps** table by using the `create_array_of_maps_table()` function. ```{r array-of-maps-table} RPresto:::create_array_of_maps_table( con, table_name = "presto_array_of_maps", verbose = FALSE ) ``` We can check if the table now exists in Presto. ```{r check-array-of-maps-table} DBI::dbExistsTable(con, "presto_array_of_maps") ``` We can list the fields in the table. ```{r array-of-maps-table-fields} DBI::dbListFields(con, "presto_array_of_maps") ``` Let's import all the data into R. ```{r array-of-maps} ( df.array_of_maps <- dbGetQuery( con, "SELECT * FROM presto_array_of_maps", bigint = "integer64" ) ) ``` We need to pry open the wrapping unnamed list to reveal the types of the vectors underneath. ```{r array-of-map-check} tibble::enframe(purrr::map_chr(df.array_of_maps, ~class(.[[1]][[1]])[1])) ``` ## `ROW` type The easiest way to think about the `ROW` type in Presto is to think of it literally as a *row* of a table. Just as a table can have multiple columns of different data types, a `ROW` can have multiple elements of different types. And just like a table having a name for each column, every element of a `ROW` has a name associated with the value. Depending on whether the `ROW` type is repeated (i.e., wrapped in an `ARRAY`), the translation into R is different. * We translate single `ROW` value to a named list in R. * Rather than interpret repeated `ROW`s (i.e., `ARRAY` of `ROW`s) as a list of named lists, we translate the collection of `ROW`s into a tibble. | Presto type | R type | |-------------|--------| | Single ROW | named list | | Repeated ROWs | tibble | ### Single `ROW` translation To demonstrate how `ROW` types are translated into R types, we first create a table using an auxiliary `create_primitive_rows_table()` function included in the package. The resulting table has only 1 column named `row_primitive_types` which is a `ROW` that includes 18 sub-columns representing all supported primitive types. ```{r primitive-rows-table} RPresto:::create_primitive_rows_table( con, table_name = "presto_primitive_rows", verbose = FALSE ) ``` We can check if the table now exists in Presto. ```{r check-primitive_rows-table} DBI::dbExistsTable(con, "presto_primitive_rows") ``` We can list the fields in the table. ```{r primitive-rows-table-fields} DBI::dbListFields(con, "presto_primitive_rows") ``` We can then retrieve all the data from the table. ```{r row-of-primitive} ( df.row_of_primitive <- dbGetQuery( con, "SELECT row_primitive_types FROM presto_primitive_rows", bigint = "integer64" ) ) ``` We can check the R types of each element in the named list. ```{r row-of-primitive-check} tibble::enframe( purrr::map_chr(df.row_of_primitive$row_primitive_types[[1]], ~class(.)[1]) ) ``` ### Repeated `ROW`s translation To construct a repeated `ROW` column, we use the auxiliary `create_array_of_rows_table()` function. ```{r arrays-of-rows-table} RPresto:::create_array_of_rows_table( con, table_name = "presto_array_of_rows", verbose = FALSE ) ``` We can check if the table now exists in Presto and the field name. ```{r check-array-of-rows-table} DBI::dbExistsTable(con, "presto_array_of_rows") ``` ```{r array-of-rows-table-fields} DBI::dbListFields(con, "presto_array_of_rows") ``` We can import the whole data into R. ```{r array-of-rows} ( df.array_of_rows <- dbGetQuery( con, "SELECT array_of_rows FROM presto_array_of_rows", bigint = "integer64" ) ) ``` We can verify the tibble's column types. ```{r array-of-row-check} tibble::enframe( purrr::map_chr(df.array_of_rows$array_of_rows[[1]], ~class(.)[1]) ) ``` ## Walkthrough wrap-up ```{r close-db} DBI::dbDisconnect(con) ``` [1]: https://prestodb.io/docs/current/language/types.html#structural [2]: https://prestodb.io/docs/current/connector/memory.html