Matt Worthington - Gnarly Data w/ Arrow, DuckDB, + SQL

Why This Post?

At work, we’re gearing up to start using SQL databases and, admittedly, it’s been a while since I’ve used SQL. While I can write dplyr in my sleep (which is very similar), I’ve been wanting to get some practice rounds in to refresh my SQL familiarity with interesting data. In part, it’s for me but we also have folks on our team that are learning how to code and some will be using SQL for the first time. Coincidentally, Spatial Data Wizard Kyle Walker, posted this on twitter which caught my eye:

Many data scientists find that SQL and databases are crucial skills on the job. But how do you learn these skills coming from the #rstats world?

Ch11 of my book gets you practice setting up, querying, and analyzing a 9m record database from @ipums: https://t.co/0hhD3f30Wo pic.twitter.com/0GW1tdAQYY
— Kyle Walker (@kyle_e_walker) August 2, 2023

Using Arrow + DuckDB for SQL Practice

So I got the tutorial going and right about the time I started importing the IPUMS data into my PostgreSQL with DBeaver, I realized that loading it would take 2 hours and opted for something that could let me play with the data in SQL much faster.

Enter Arrow + DuckDB. If you haven’t heard of Arrow or DuckDB, they’re fairly new tools to R’s world of wizardry. They work across multiple languages, but I’ll be using their R docs for now because they’re really good.

So what are they? Well, Arrow bills itself as making it easy to work with “in-memory and larger-than-memory data” while DuckDB is an “in-process SQL OLAP database management system”. In layman’s terms, their purpose is for handling ridiculously large data within your local environment that may otherwise only be possible using larger, more involved database management systems. Together, they “are capable of executing on data without fully loading it from disk”.

That last part is key for me right now, because I don’t have 2 hours to load the export from Kyle’s tutorial in PostgreSQL before practicing. Granted, the data is pretty massive at 126,567,125 rows and 17 columns. I can’t even load the 10GB CSV extract on my M1 MacBook Pro with read_csv(). If I try, RStudio’s console casually starts out like this…

indexing usa_00002.csv [========================================================--] 505.37MB/s, eta:  1s

Then, it just kind of hangs. After several minutes, RStudio finally taps out and says:

> ipums_csv <- read_csv("usa_00002.csv")
Error: vector memory exhausted (limit reached?)

So how do we load this into a database that will let us run SQL queries on Kyle’s gigantic tutorial data? Fortunately, Arrow + DuckDB’s R Packages let us easily read our CSV file in as an Arrow Table, explore it a bit, and then load it as a database that will allow us to run SQL queries on it. ¹

For context, I’m going to use a parquet format (which is much lighter than CSV), but instead of read_parquet(), you can use read_csv_arrow() on the CSV export from IPUMS to get the same results. It will just result in slightly slower load times.

Show the code | Build A SQL Database with Arrow + DuckDB

library(arrow)
library(duckdb)

## Reads Our IPUMS Parquet into An Arrow Table
ipums_db <- arrow::read_parquet("raw_data/ipums_export.parquet", as_data_frame = FALSE)

# Establish A Database Connection with DuckDB
con <- dbConnect(duckdb::duckdb())

# Registers arrow table as a DuckDB view
arrow::to_duckdb(ipums_db, table_name = "ipums_db", con = con)

# Source:   table<ipums_db> [?? x 17]
# Database: DuckDB 0.8.1 [root@Darwin 22.2.0:R 4.3.1/:memory:]
    YEAR SAMPLE SERIAL CBSERIAL  HHWT CLUSTER STATEFIP STRATA    GQ PERNUM PERWT
   <int>  <int>  <int>    <dbl> <dbl>   <dbl>    <int>  <dbl> <int>  <int> <dbl>
 1  1850 185001    101       NA    97 1.85e12        1 1.10e8     1      1    97
 2  1850 185001    101       NA    97 1.85e12        1 1.10e8     1      2    97
 3  1850 185001    101       NA    97 1.85e12        1 1.10e8     1      3    97
 4  1850 185001    101       NA    97 1.85e12        1 1.10e8     1      4    97
 5  1850 185001    101       NA    97 1.85e12        1 1.10e8     1      5    97
 6  1850 185001    101       NA    97 1.85e12        1 1.10e8     1      6    97
 7  1850 185001    101       NA    97 1.85e12        1 1.10e8     1      7    97
 8  1850 185001    101       NA    97 1.85e12        1 1.10e8     1      8    97
 9  1850 185001    101       NA    97 1.85e12        1 1.10e8     1      9    97
10  1850 185001    201       NA    97 1.85e12        1 1.10e8     1      1    97
# ℹ more rows
# ℹ 6 more variables: SEX <int>, AGE <int>, MARST <int>, LIT <int>,
#   VERSIONHIST <int>, HISTID <chr>

Exploratory Analysis with SQL Queries On DuckDB

Generally, I know what’s in this data export. It’s a complete-count census file from 1910. It has features that tell us about the population (like age, sex, state, marital status, literacy, etc.) but now that it’s loaded, I want to do some high-level exploration of what’s in the 126 million rows and 17 columns. As an eyebrow raiser, I’ll note that Kyle’s tutorial says the export should have 92 million rows. So what gives?

Previewing Our Data

For starters, I need to recall all of the features we exported so I know exactly what I’m working with before I identify why our sample is different than the one Kyke reported. It’s technically available above, but I want a small look at just 10 rows, so I’m gonna pull that sample and preview all table columns:

Show the code | Previewing 10 Observations from our DB

SELECT * FROM ipums_db LIMIT 10

Displaying records 1 - 10
YEAR	SAMPLE	SERIAL	CBSERIAL	HHWT	CLUSTER	STATEFIP	STRATA	GQ	PERNUM	PERWT	SEX	AGE	MARST	LIT	VERSIONHIST	HISTID
1850	185001	101	NA	97	1.85e+12	1	110100100	1	1	97	1	43	NA	4	NA	NA
1850	185001	101	NA	97	1.85e+12	1	110100100	1	2	97	2	38	NA	4	NA	NA
1850	185001	101	NA	97	1.85e+12	1	110100100	1	3	97	2	17	NA	0	NA	NA
1850	185001	101	NA	97	1.85e+12	1	110100100	1	4	97	2	14	NA	0	NA	NA
1850	185001	101	NA	97	1.85e+12	1	110100100	1	5	97	1	11	NA	0	NA	NA
1850	185001	101	NA	97	1.85e+12	1	110100100	1	6	97	2	8	NA	0	NA	NA
1850	185001	101	NA	97	1.85e+12	1	110100100	1	7	97	2	7	NA	0	NA	NA
1850	185001	101	NA	97	1.85e+12	1	110100100	1	8	97	2	5	NA	0	NA	NA
1850	185001	101	NA	97	1.85e+12	1	110100100	1	9	97	2	1	NA	0	NA	NA
1850	185001	201	NA	97	1.85e+12	1	110100100	1	1	97	1	38	NA	4	NA	NA

Filtering On Columns

Single Column

Okay, so a quick look at our sample reveals we accidentally grabbed more than the 1910 data from Kyle’s tutorial, which may explain the discrepancy in sample sizes. Let’s filter it down to match Kyle’s reported figure of 92 million rows and count it to verify since ours is 126 million.

Show the code | Counting All Observations From 1910

SELECT count(*) FROM ipums_db
  WHERE YEAR = 1910

1 records
count_star()
92966771

It’s a match! So now, let’s re-look at a small sample with the query below, using only 1910 data. When we run it, it’s easy to see that some features now have data (like marital status) whereas they didn’t before. Likely, this may have been because the 1850 complete-count census had not included that question. Another big difference between the two is the HHWT column, which tells us how many households are represented in this count. For the 1850 census, they reported only accounting for 97% of US Households in their complete-count data. For the 1910 census, we see that 100% of US Households are accounted for.

Show the code | Filtering on A Single Column

SELECT * FROM ipums_db
  WHERE YEAR = 1910
  LIMIT 10

Displaying records 1 - 10
YEAR	SAMPLE	SERIAL	CBSERIAL	HHWT	CLUSTER	STATEFIP	STRATA	GQ	PERNUM	PERWT	SEX	AGE	MARST	LIT	VERSIONHIST	HISTID
1910	191002	101	NA	100	1.91e+12	1	110100100	1	1	100	1	23	1	1	NA	NA
1910	191002	101	NA	100	1.91e+12	1	110100100	1	2	100	2	25	1	4	NA	NA
1910	191002	102	NA	100	1.91e+12	1	110100100	1	1	100	2	30	2	4	NA	NA
1910	191002	201	NA	100	1.91e+12	1	110100100	1	1	100	1	35	1	4	NA	NA
1910	191002	201	NA	100	1.91e+12	1	110100100	1	2	100	2	35	1	4	NA	NA
1910	191002	201	NA	100	1.91e+12	1	110100100	1	3	100	1	8	6	0	NA	NA
1910	191002	201	NA	100	1.91e+12	1	110100100	1	4	100	2	5	6	0	NA	NA
1910	191002	201	NA	100	1.91e+12	1	110100100	1	5	100	2	3	6	0	NA	NA
1910	191002	201	NA	100	1.91e+12	1	110100100	1	6	100	1	1	6	0	NA	NA
1910	191002	301	NA	100	1.91e+12	1	110100100	1	1	100	1	69	1	4	NA	NA

Multiple Columns

Deviating a bit from Kyle’s tutorial a bit, I’m going to do an aside and explore age distributions in Texas for this example. To do that, we can start by filtering for Texans within the 1910 census sample using the STATEFIP Feature, which uses “FIPS” codes to classify states. Texas is number 48.

Now, I did count this sample outside of the tutorial (just to keep it brief) and the count for Texas in 1910 was around 3.9 million. Just over 100 years later, we’re now over 30 million. That’s pretty wild.

Show the code | Filtering on Multiple Columns

SELECT YEAR, SAMPLE, STATEFIP, SEX, AGE, LIT, MARST FROM ipums_db
  WHERE YEAR = 1910
  AND STATEFIP = 48
  LIMIT 10

Displaying records 1 - 10
YEAR	SAMPLE	STATEFIP	SEX	AGE	LIT	MARST
1910	191002	48	1	55	4	1
1910	191002	48	2	51	4	1
1910	191002	48	2	27	4	6
1910	191002	48	2	16	4	6
1910	191002	48	1	26	1	1
1910	191002	48	2	24	4	1
1910	191002	48	2	7	0	6
1910	191002	48	2	5	0	6
1910	191002	48	2	2	0	6
1910	191002	48	1	28	4	1

Grouping + Summary Operations

Okay, so now we’ve previewed our Texas data, let’s build off our query above and use GROUP BY on the AGE and SEX variables. Interestingly, our dataset says there were a couple of folks who lived to see 120 and one woman who lived to 130 in 1910. Now, I think these folks were lying about their birth certificates, but whatever. I’m gonna set that aside for now and work with it.

Show the code | Counting Texans by Age + Gender in 1910

SELECT SEX, AGE, COUNT(*) AS "count" 
  FROM ipums_db
  WHERE YEAR = 1910 AND STATEFIP = 48
  GROUP BY AGE, SEX
  ORDER BY AGE

Displaying records 1 - 10
SEX	AGE	count
2	0	56893
1	0	58459
1	1	50299
2	1	47805
2	2	55398
1	2	57393
1	3	56565
2	3	55604
1	4	55697
2	4	54196

Logical Operations

So, maybe you’re wondering how I knew our 130yo Texan was a woman. And it’s because the IPUMS has a code book, which articulates the classification system used for certain features. In this case, it’s 1 for Male and 2 for Female. That’s all good and well, but I don’t think it’s readable and it’d be a lot better if we just reclassified these codes. We can do this by building off our query with CASE & WHEN

Show the code | Reclassifying from Coded Gender Vars

SELECT AGE as "age",
  CASE WHEN SEX = 1 then 'Male'
       WHEN SEX = 2 then 'Female'
       ELSE 'Not Specified'
  END AS gender,
  COUNT(*) AS "count"
  FROM ipums_db
  WHERE YEAR = 1910 AND STATEFIP = 48
  GROUP BY age, gender
  ORDER BY age

Visualizing Our Query

From here, I may want to take this data and visualize it. Unfortunately, SQL really isn’t the best place to do that, but I can just pass this data to R or Python with Quarto and start going to work on some nice charts. In my case, I’ve tried to line up these two distributions and compare them by decade and gender for a simple visual comparison. While imagining the visualization in my mind, I decided to do some extra transformation on the 253 rows I pulled. It was simple enough to quickly do that within R and pass it to our chart for the comparison.

At a quick glance, we can some similarities in population size before age 30. Once 30 hits, the gaps start to widen. We could explore that more, but that’s not this post! So I’ll stop elaborating on this visualization for now.

Show the code | R Code to Visualize Our Query Data

library(tidyverse)

chart_query_df <- as_query |> 
  mutate(decade = cut(age, breaks = c(seq(0, max(age) + 9, by = 10), Inf), 
                      right = FALSE, labels = FALSE, include.lowest = TRUE),
         decade = decade-1,
         decade_lbl = if_else(decade==0,
                              paste0(decade, "-", decade+1, "0"),
                              paste0(decade, "0-", decade+1, "0"))) |> 
  group_by(decade, decade_lbl, gender) |> 
  summarize(count = sum(count))

chart_query_df |> 
  ggplot() +
  aes(x=reorder(decade_lbl, decade), y = count, fill = gender) +
  geom_col(alpha = 0.8, position = position_dodge()) +
  geom_col(data = chart_query_df |> filter(decade >= 30), position = position_dodge(),) + 
  scale_y_continuous(labels = scales::label_comma(scale = .001, suffix = "K"),
                     n.breaks = 6) +
  scale_fill_manual(values = c(Male = "#005f86", Female = "#bf5700")) + 
  mrworthingtonR::theme_mrw(base_family = "Graphik",
                            subtitle_size = 18) +
  theme(plot.title = ggtext::element_markdown(margin = ggplot2::margin(b = 5), hjust = 0),
        panel.grid.major.y = element_line(color = alpha("#8d8d8d", 0.2), linetype = 1)) +
  labs(title = "In 1910, Texas <span style='color:#005f86'>Men</span> Seemed to Live Longer than <span style='color:#bf5700'>Women</span> After Age 30",
       subtitle = "Distributions Among Texan Ages, by Gender | 1910 Complete US Census Count",
       y = "Population Count",
       x = "Age Group",
       caption = "Source: IPUMS 1910 Complete-Count US Census File (CSV)")

Concluding Thoughts

And that’s it for me. There’s obviously more I could do more with the chart itself and there’s far more questions to ask about the underlying data (which Kyle does on his tutorial), but I’m going to pause here since this was a post to show how you can use R, Arrow, + DuckDB to wrangle gigantic volumes of data and derive insights with SQL queries! So, if you were looking for a quick database environment going so you can practice SQL, you can use this same approach with virtually any other data. Just load it into a DuckDB and start running SQL query chunks within a Quarto Doc!

If you have any questions, revisions, or suggestions, just shoot me a message on twitter or X or whatever it is now.

Acknowledgments

Photo by Melissa Walker Horn on Unsplash

Footnotes

For a good tutorial on this approach, visit this page on DuckDB’s blog.↩︎

--- title: "Gnarly Data w/ Arrow, DuckDB, + SQL" description: | Knowing SQL is a must for Data Scientists + other analytics professionals, but how can R users start practicing their SQL skills in a familiar environment? That's this post! author: - name: Matt Worthington url: https://twitter.com/mrworthington date: 2023-08-13 page-layout: full format: html: date_format: medium css: page_header.css fig-width: 8 toc-depth: 3 warning: false message: false code-fold: true code-tools: true code-link: true title-block-banner: true title-block-banner-color: "#ffffff" image: melissa-walker-horn-U7bOjNIqisM-unsplash.jpg categories: - ggplot2 - databases - arrow - sql - duckdb filters: - lightbox lightbox: auto --- ## Why This Post? At work, we're gearing up to start using SQL databases and, admittedly, it's been a while since I've used SQL. While I can write [dplyr](https://dplyr.tidyverse.org) in my sleep ([which is very similar](https://medium.com/mlearning-ai/dplyr-vs-sql-c7277abc9482#:~:text=SQL%20and%20dplyr%20both%20are,%3E%20val1%2C%20...)), I've been wanting to get some practice rounds in to refresh my SQL familiarity with interesting data. In part, it's for me but we also have folks on our team that are learning how to code and some will be using SQL for the first time. Coincidentally, Spatial Data Wizard Kyle Walker, posted this on twitter which caught my eye: ```{=html} <blockquote class="twitter-tweet"><p lang="en" dir="ltr">Many data scientists find that SQL and databases are crucial skills on the job. But how do you learn these skills coming from the <a href="https://twitter.com/hashtag/rstats?src=hash&ref_src=twsrc%5Etfw">#rstats</a> world?<br><br>Ch11 of my book gets you practice setting up, querying, and analyzing a 9m record database from <a href="https://twitter.com/ipums?ref_src=twsrc%5Etfw">@ipums</a>: <a href="https://t.co/0hhD3f30Wo">https://t.co/0hhD3f30Wo</a> <a href="https://t.co/0GW1tdAQYY">pic.twitter.com/0GW1tdAQYY</a></p>— Kyle Walker (@kyle_e_walker) <a href="https://twitter.com/kyle_e_walker/status/1686762575306907648?ref_src=twsrc%5Etfw">August 2, 2023</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> ``` ## Using Arrow + DuckDB for SQL Practice So I got the tutorial going and right about the time I started importing the IPUMS data into my PostgreSQL with DBeaver, I realized that loading it would take 2 hours and opted for something that could let me play with the data in SQL much faster. Enter [Arrow](https://arrow.apache.org/docs/r/) + [DuckDB](https://duckdb.org). If you haven't heard of Arrow or DuckDB, they're fairly new tools to R's world of wizardry. They work across multiple languages, but I'll be using their R docs for now because they're really good. So what are they? Well, Arrow bills itself as making it easy to work with "in-memory and larger-than-memory data" while DuckDB is an "in-process SQL OLAP database management system". In layman's terms, their purpose is for handling ridiculously large data within your local environment that may otherwise only be possible using larger, more involved database management systems. Together, they "are capable of executing on data without fully loading it from disk". That last part is key for me right now, because I don't have 2 hours to load the export from Kyle's tutorial in PostgreSQL before practicing. Granted, the data is pretty massive at 126,567,125 rows and 17 columns. I can't even load the 10GB CSV extract on my M1 MacBook Pro with `read_csv()`. If I try, RStudio's console casually starts out like this... ``` r indexing usa_00002.csv [========================================================--] 505.37MB/s, eta: 1s ``` Then, it just kind of hangs. After several minutes, RStudio finally taps out and says: ``` r > ipums_csv <- read_csv("usa_00002.csv") Error: vector memory exhausted (limit reached?) ``` So how do we load this into a database that will let us run SQL queries on Kyle's gigantic tutorial data? Fortunately, Arrow + DuckDB's R Packages let us easily read our CSV file in as an Arrow Table, explore it a bit, and then load it as a database that will allow us to run SQL queries on it. [^1] [^1]: For a good tutorial on this approach, visit [this page on DuckDB's blog](%22https://duckdb.org/2021/12/03/duck-arrow.html). ::: {.callout-note appearance="simple"} ## For context, I'm going to use a parquet format (which is much lighter than CSV), but instead of `read_parquet()`, you can use `read_csv_arrow()` on the CSV export from IPUMS to get the same results. It will just result in slightly slower load times. ::: ```{r} #| echo: true #| code-fold: show #| class-output: sourceCode r code-with-copy #| code-summary: "**Show the code** | Build A SQL Database with Arrow + DuckDB" library(arrow) library(duckdb) ## Reads Our IPUMS Parquet into An Arrow Table ipums_db <- arrow::read_parquet("raw_data/ipums_export.parquet", as_data_frame = FALSE) # Establish A Database Connection with DuckDB con <- dbConnect(duckdb::duckdb()) # Registers arrow table as a DuckDB view arrow::to_duckdb(ipums_db, table_name = "ipums_db", con = con) ``` ## Exploratory Analysis with SQL Queries On DuckDB Generally, I know what's in this data export. It's a complete-count census file from 1910. It has features that tell us about the population (like age, sex, state, marital status, literacy, etc.) but now that it's loaded, I want to do some high-level exploration of what's in the 126 million rows and 17 columns. As an eyebrow raiser, I'll note that Kyle's tutorial says the export should have 92 million rows. So what gives? ### Previewing Our Data For starters, I need to recall all of the features we exported so I know exactly what I'm working with before I identify why our sample is different than the one Kyke reported. It's technically available above, but I want a small look at just 10 rows, so I'm gonna pull that sample and preview all table columns: ```{sql connection = "con"} #| echo: true #| code-fold: show #| class-output: sourceCode sql code-with-copy #| code-summary: "**Show the code** | Previewing 10 Observations from our DB" SELECT * FROM ipums_db LIMIT 10 ``` ### Filtering On Columns #### Single Column Okay, so a quick look at our sample reveals we accidentally grabbed more than the 1910 data from Kyle's tutorial, which may explain the discrepancy in sample sizes. Let's filter it down to match Kyle's reported figure of 92 million rows and count it to verify since ours is 126 million. ```{sql connection = "con"} #| echo: true #| code-fold: show #| class-output: sourceCode sql code-with-copy #| code-summary: "**Show the code** | Counting All Observations From 1910" SELECT count(*) FROM ipums_db WHERE YEAR = 1910 ``` It's a match! So now, let's re-look at a small sample with the query below, using only 1910 data. When we run it, it's easy to see that some features now have data (like marital status) whereas they didn't before. Likely, this may have been because the 1850 complete-count census had not included that question. Another big difference between the two is the `HHWT` column, which tells us how many households are represented in this count. For the 1850 census, they reported only accounting for 97% of US Households in their complete-count data. For the 1910 census, we see that 100% of US Households are accounted for. ```{sql connection = "con"} #| echo: true #| code-fold: show #| class-output: sourceCode sql code-with-copy #| code-summary: "**Show the code** | Filtering on A Single Column" SELECT * FROM ipums_db WHERE YEAR = 1910 LIMIT 10 ``` #### Multiple Columns Deviating a bit from Kyle's tutorial a bit, I'm going to do an aside and explore age distributions in Texas for this example. To do that, we can start by filtering for Texans within the 1910 census sample using the `STATEFIP` Feature, which uses "FIPS" codes to classify states. Texas is number 48. ::: aside Now, I did count this sample outside of the tutorial (just to keep it brief) and the count for Texas in 1910 was around 3.9 million. Just over 100 years later, we're now over 30 million. That's pretty wild. ::: ```{sql connection = "con"} #| echo: true #| code-fold: show #| class-output: sourceCode sql code-with-copy #| code-summary: "**Show the code** | Filtering on Multiple Columns" SELECT YEAR, SAMPLE, STATEFIP, SEX, AGE, LIT, MARST FROM ipums_db WHERE YEAR = 1910 AND STATEFIP = 48 LIMIT 10 ``` ### Grouping + Summary Operations Okay, so now we've previewed our Texas data, let's build off our query above and use `GROUP BY` on the `AGE` and `SEX` variables. Interestingly, our dataset says there were a couple of folks who lived to see 120 and one woman who lived to 130 in 1910. Now, I think these folks were lying about their birth certificates, but whatever. I'm gonna set that aside for now and work with it. ```{sql connection = "con"} #| echo: true #| code-fold: show #| class-output: sourceCode sql code-with-copy #| code-summary: "**Show the code** | Counting Texans by Age + Gender in 1910" SELECT SEX, AGE, COUNT(*) AS "count" FROM ipums_db WHERE YEAR = 1910 AND STATEFIP = 48 GROUP BY AGE, SEX ORDER BY AGE ``` ### Logical Operations So, maybe you're wondering how I knew our 130yo Texan was a woman. And it's because the IPUMS has a code book, which articulates the classification system used for certain features. In this case, it's 1 for Male and 2 for Female. That's all good and well, but I don't think it's readable and it'd be a lot better if we just reclassified these codes. We can do this by building off our query with `CASE` & `WHEN` ```{sql} #| code-fold: show #| connection: con #| output.var: as_query #| cache: true #| class-output: sourceCode sql code-with-copy #| code-summary: "**Show the code** | Reclassifying from Coded Gender Vars" SELECT AGE as "age", CASE WHEN SEX = 1 then 'Male' WHEN SEX = 2 then 'Female' ELSE 'Not Specified' END AS gender, COUNT(*) AS "count" FROM ipums_db WHERE YEAR = 1910 AND STATEFIP = 48 GROUP BY age, gender ORDER BY age ``` ### Visualizing Our Query From here, I may want to take this data and visualize it. Unfortunately, SQL really isn't the best place to do that, but I can just pass this data to R or Python with Quarto and start going to work on some nice charts. In my case, I've tried to line up these two distributions and compare them by decade and gender for a simple visual comparison. While imagining the visualization in my mind, I decided to do some extra transformation on the 253 rows I pulled. It was simple enough to quickly do that within R and pass it to our chart for the comparison. At a quick glance, we can some similarities in population size before age 30. Once 30 hits, the gaps start to widen. We could explore that more, but that's not this post! So I'll stop elaborating on this visualization for now. ```{r} #| code-fold: true #| connection: con #| output.var: as_query #| fig-width: 12 #| fig-height: 9 #| cache: true #| column: body-outset #| code-summary: "**Show the code** | R Code to Visualize Our Query Data" library(tidyverse) chart_query_df <- as_query |> mutate(decade = cut(age, breaks = c(seq(0, max(age) + 9, by = 10), Inf), right = FALSE, labels = FALSE, include.lowest = TRUE), decade = decade-1, decade_lbl = if_else(decade==0, paste0(decade, "-", decade+1, "0"), paste0(decade, "0-", decade+1, "0"))) |> group_by(decade, decade_lbl, gender) |> summarize(count = sum(count)) chart_query_df |> ggplot() + aes(x=reorder(decade_lbl, decade), y = count, fill = gender) + geom_col(alpha = 0.8, position = position_dodge()) + geom_col(data = chart_query_df |> filter(decade >= 30), position = position_dodge(),) + scale_y_continuous(labels = scales::label_comma(scale = .001, suffix = "K"), n.breaks = 6) + scale_fill_manual(values = c(Male = "#005f86", Female = "#bf5700")) + mrworthingtonR::theme_mrw(base_family = "Graphik", subtitle_size = 18) + theme(plot.title = ggtext::element_markdown(margin = ggplot2::margin(b = 5), hjust = 0), panel.grid.major.y = element_line(color = alpha("#8d8d8d", 0.2), linetype = 1)) + labs(title = "In 1910, Texas <span style='color:#005f86'>Men</span> Seemed to Live Longer than <span style='color:#bf5700'>Women</span> After Age 30", subtitle = "Distributions Among Texan Ages, by Gender | 1910 Complete US Census Count", y = "Population Count", x = "Age Group", caption = "Source: IPUMS 1910 Complete-Count US Census File (CSV)") ``` ## Concluding Thoughts And that's it for me. There's obviously more I could do more with the chart itself and there's far more questions to ask about the underlying data (which Kyle does on his tutorial), but I'm going to pause here since this was a post to show how you can use R, Arrow, + DuckDB to wrangle gigantic volumes of data and derive insights with SQL queries! So, if you were looking for a quick database environment going so you can practice SQL, you can use this same approach with virtually any other data. Just load it into a DuckDB and start running SQL query chunks within a Quarto Doc! If you have any questions, revisions, or suggestions, just shoot me a message [on twitter or X or whatever it is now](https://twitter.com/mrworthington). ## Acknowledgments {.appendix} Photo by <a href="https://unsplash.com/@sugercoatit?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Melissa Walker Horn</a> on <a href="https://unsplash.com/photos/U7bOjNIqisM?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>