Knowing SQL is a must for Data Scientists + other analytics professionals, but how can R users start practicing their SQL skills in a familiar environment? That’s this post!
At work, we’re gearing up to start using SQL databases and, admittedly, it’s been a while since I’ve used SQL. While I can write dplyr in my sleep (which is very similar), I’ve been wanting to get some practice rounds in to refresh my SQL familiarity with interesting data. In part, it’s for me but we also have folks on our team that are learning how to code and some will be using SQL for the first time. Coincidentally, Spatial Data Wizard Kyle Walker, posted this on twitter which caught my eye:
Many data scientists find that SQL and databases are crucial skills on the job. But how do you learn these skills coming from the #rstats world?
So I got the tutorial going and right about the time I started importing the IPUMS data into my PostgreSQL with DBeaver, I realized that loading it would take 2 hours and opted for something that could let me play with the data in SQL much faster.
Enter Arrow + DuckDB. If you haven’t heard of Arrow or DuckDB, they’re fairly new tools to R’s world of wizardry. They work across multiple languages, but I’ll be using their R docs for now because they’re really good.
So what are they? Well, Arrow bills itself as making it easy to work with “in-memory and larger-than-memory data” while DuckDB is an “in-process SQL OLAP database management system”. In layman’s terms, their purpose is for handling ridiculously large data within your local environment that may otherwise only be possible using larger, more involved database management systems. Together, they “are capable of executing on data without fully loading it from disk”.
That last part is key for me right now, because I don’t have 2 hours to load the export from Kyle’s tutorial in PostgreSQL before practicing. Granted, the data is pretty massive at 126,567,125 rows and 17 columns. I can’t even load the 10GB CSV extract on my M1 MacBook Pro with read_csv(). If I try, RStudio’s console casually starts out like this…
So how do we load this into a database that will let us run SQL queries on Kyle’s gigantic tutorial data? Fortunately, Arrow + DuckDB’s R Packages let us easily read our CSV file in as an Arrow Table, explore it a bit, and then load it as a database that will allow us to run SQL queries on it. 1
For context, I’m going to use a parquet format (which is much lighter than CSV), but instead of read_parquet(), you can use read_csv_arrow() on the CSV export from IPUMS to get the same results. It will just result in slightly slower load times.
Show the code | Build A SQL Database with Arrow + DuckDB
library(arrow)library(duckdb)## Reads Our IPUMS Parquet into An Arrow Tableipums_db<-arrow::read_parquet("raw_data/ipums_export.parquet", as_data_frame =FALSE)# Establish A Database Connection with DuckDBcon<-dbConnect(duckdb::duckdb())# Registers arrow table as a DuckDB viewarrow::to_duckdb(ipums_db, table_name ="ipums_db", con =con)
# Source: table<ipums_db> [?? x 17]# Database: DuckDB 0.8.1 [root@Darwin 22.2.0:R 4.3.1/:memory:] YEAR SAMPLE SERIAL CBSERIAL HHWT CLUSTER STATEFIP STRATA GQ PERNUM PERWT<int><int><int><dbl><dbl><dbl><int><dbl><int><int><dbl>11850185001101NA971.85e1211.10e8119721850185001101NA971.85e1211.10e8129731850185001101NA971.85e1211.10e8139741850185001101NA971.85e1211.10e8149751850185001101NA971.85e1211.10e8159761850185001101NA971.85e1211.10e8169771850185001101NA971.85e1211.10e8179781850185001101NA971.85e1211.10e8189791850185001101NA971.85e1211.10e81997101850185001201NA971.85e1211.10e81197# ℹ more rows# ℹ 6 more variables: SEX <int>, AGE <int>, MARST <int>, LIT <int>,# VERSIONHIST <int>, HISTID <chr>
Exploratory Analysis with SQL Queries On DuckDB
Generally, I know what’s in this data export. It’s a complete-count census file from 1910. It has features that tell us about the population (like age, sex, state, marital status, literacy, etc.) but now that it’s loaded, I want to do some high-level exploration of what’s in the 126 million rows and 17 columns. As an eyebrow raiser, I’ll note that Kyle’s tutorial says the export should have 92 million rows. So what gives?
Previewing Our Data
For starters, I need to recall all of the features we exported so I know exactly what I’m working with before I identify why our sample is different than the one Kyke reported. It’s technically available above, but I want a small look at just 10 rows, so I’m gonna pull that sample and preview all table columns:
Show the code | Previewing 10 Observations from our DB
SELECT*FROM ipums_db LIMIT10
Displaying records 1 - 10
YEAR
SAMPLE
SERIAL
CBSERIAL
HHWT
CLUSTER
STATEFIP
STRATA
GQ
PERNUM
PERWT
SEX
AGE
MARST
LIT
VERSIONHIST
HISTID
1850
185001
101
NA
97
1.85e+12
1
110100100
1
1
97
1
43
NA
4
NA
NA
1850
185001
101
NA
97
1.85e+12
1
110100100
1
2
97
2
38
NA
4
NA
NA
1850
185001
101
NA
97
1.85e+12
1
110100100
1
3
97
2
17
NA
0
NA
NA
1850
185001
101
NA
97
1.85e+12
1
110100100
1
4
97
2
14
NA
0
NA
NA
1850
185001
101
NA
97
1.85e+12
1
110100100
1
5
97
1
11
NA
0
NA
NA
1850
185001
101
NA
97
1.85e+12
1
110100100
1
6
97
2
8
NA
0
NA
NA
1850
185001
101
NA
97
1.85e+12
1
110100100
1
7
97
2
7
NA
0
NA
NA
1850
185001
101
NA
97
1.85e+12
1
110100100
1
8
97
2
5
NA
0
NA
NA
1850
185001
101
NA
97
1.85e+12
1
110100100
1
9
97
2
1
NA
0
NA
NA
1850
185001
201
NA
97
1.85e+12
1
110100100
1
1
97
1
38
NA
4
NA
NA
Filtering On Columns
Single Column
Okay, so a quick look at our sample reveals we accidentally grabbed more than the 1910 data from Kyle’s tutorial, which may explain the discrepancy in sample sizes. Let’s filter it down to match Kyle’s reported figure of 92 million rows and count it to verify since ours is 126 million.
Show the code | Counting All Observations From 1910
SELECTcount(*) FROM ipums_dbWHEREYEAR=1910
1 records
count_star()
92966771
It’s a match! So now, let’s re-look at a small sample with the query below, using only 1910 data. When we run it, it’s easy to see that some features now have data (like marital status) whereas they didn’t before. Likely, this may have been because the 1850 complete-count census had not included that question. Another big difference between the two is the HHWT column, which tells us how many households are represented in this count. For the 1850 census, they reported only accounting for 97% of US Households in their complete-count data. For the 1910 census, we see that 100% of US Households are accounted for.
Show the code | Filtering on A Single Column
SELECT*FROM ipums_dbWHEREYEAR=1910LIMIT10
Displaying records 1 - 10
YEAR
SAMPLE
SERIAL
CBSERIAL
HHWT
CLUSTER
STATEFIP
STRATA
GQ
PERNUM
PERWT
SEX
AGE
MARST
LIT
VERSIONHIST
HISTID
1910
191002
101
NA
100
1.91e+12
1
110100100
1
1
100
1
23
1
1
NA
NA
1910
191002
101
NA
100
1.91e+12
1
110100100
1
2
100
2
25
1
4
NA
NA
1910
191002
102
NA
100
1.91e+12
1
110100100
1
1
100
2
30
2
4
NA
NA
1910
191002
201
NA
100
1.91e+12
1
110100100
1
1
100
1
35
1
4
NA
NA
1910
191002
201
NA
100
1.91e+12
1
110100100
1
2
100
2
35
1
4
NA
NA
1910
191002
201
NA
100
1.91e+12
1
110100100
1
3
100
1
8
6
0
NA
NA
1910
191002
201
NA
100
1.91e+12
1
110100100
1
4
100
2
5
6
0
NA
NA
1910
191002
201
NA
100
1.91e+12
1
110100100
1
5
100
2
3
6
0
NA
NA
1910
191002
201
NA
100
1.91e+12
1
110100100
1
6
100
1
1
6
0
NA
NA
1910
191002
301
NA
100
1.91e+12
1
110100100
1
1
100
1
69
1
4
NA
NA
Multiple Columns
Deviating a bit from Kyle’s tutorial a bit, I’m going to do an aside and explore age distributions in Texas for this example. To do that, we can start by filtering for Texans within the 1910 census sample using the STATEFIP Feature, which uses “FIPS” codes to classify states. Texas is number 48.
Now, I did count this sample outside of the tutorial (just to keep it brief) and the count for Texas in 1910 was around 3.9 million. Just over 100 years later, we’re now over 30 million. That’s pretty wild.
Okay, so now we’ve previewed our Texas data, let’s build off our query above and use GROUP BY on the AGE and SEX variables. Interestingly, our dataset says there were a couple of folks who lived to see 120 and one woman who lived to 130 in 1910. Now, I think these folks were lying about their birth certificates, but whatever. I’m gonna set that aside for now and work with it.
Show the code | Counting Texans by Age + Gender in 1910
So, maybe you’re wondering how I knew our 130yo Texan was a woman. And it’s because the IPUMS has a code book, which articulates the classification system used for certain features. In this case, it’s 1 for Male and 2 for Female. That’s all good and well, but I don’t think it’s readable and it’d be a lot better if we just reclassified these codes. We can do this by building off our query with CASE & WHEN
Show the code | Reclassifying from Coded Gender Vars
SELECT AGE as"age",CASEWHEN SEX =1then'Male'WHEN SEX =2then'Female'ELSE'Not Specified'ENDAS gender,COUNT(*) AS"count"FROM ipums_dbWHEREYEAR=1910AND STATEFIP =48GROUPBY age, genderORDERBY age
Visualizing Our Query
From here, I may want to take this data and visualize it. Unfortunately, SQL really isn’t the best place to do that, but I can just pass this data to R or Python with Quarto and start going to work on some nice charts. In my case, I’ve tried to line up these two distributions and compare them by decade and gender for a simple visual comparison. While imagining the visualization in my mind, I decided to do some extra transformation on the 253 rows I pulled. It was simple enough to quickly do that within R and pass it to our chart for the comparison.
At a quick glance, we can some similarities in population size before age 30. Once 30 hits, the gaps start to widen. We could explore that more, but that’s not this post! So I’ll stop elaborating on this visualization for now.
Show the code | R Code to Visualize Our Query Data
library(tidyverse)chart_query_df<-as_query|>mutate(decade =cut(age, breaks =c(seq(0, max(age)+9, by =10), Inf), right =FALSE, labels =FALSE, include.lowest =TRUE), decade =decade-1, decade_lbl =if_else(decade==0,paste0(decade, "-", decade+1, "0"),paste0(decade, "0-", decade+1, "0")))|>group_by(decade, decade_lbl, gender)|>summarize(count =sum(count))chart_query_df|>ggplot()+aes(x=reorder(decade_lbl, decade), y =count, fill =gender)+geom_col(alpha =0.8, position =position_dodge())+geom_col(data =chart_query_df|>filter(decade>=30), position =position_dodge(),)+scale_y_continuous(labels =scales::label_comma(scale =.001, suffix ="K"), n.breaks =6)+scale_fill_manual(values =c(Male ="#005f86", Female ="#bf5700"))+mrworthingtonR::theme_mrw(base_family ="Graphik", subtitle_size =18)+theme(plot.title =ggtext::element_markdown(margin =ggplot2::margin(b =5), hjust =0), panel.grid.major.y =element_line(color =alpha("#8d8d8d", 0.2), linetype =1))+labs(title ="In 1910, Texas <span style='color:#005f86'>Men</span> Seemed to Live Longer than <span style='color:#bf5700'>Women</span> After Age 30", subtitle ="Distributions Among Texan Ages, by Gender | 1910 Complete US Census Count", y ="Population Count", x ="Age Group", caption ="Source: IPUMS 1910 Complete-Count US Census File (CSV)")
Concluding Thoughts
And that’s it for me. There’s obviously more I could do more with the chart itself and there’s far more questions to ask about the underlying data (which Kyle does on his tutorial), but I’m going to pause here since this was a post to show how you can use R, Arrow, + DuckDB to wrangle gigantic volumes of data and derive insights with SQL queries! So, if you were looking for a quick database environment going so you can practice SQL, you can use this same approach with virtually any other data. Just load it into a DuckDB and start running SQL query chunks within a Quarto Doc!
---title: "Gnarly Data w/ Arrow, DuckDB, + SQL"description: | Knowing SQL is a must for Data Scientists + other analytics professionals, but how can R users start practicing their SQL skills in a familiar environment? That's this post!author: - name: Matt Worthington url: https://twitter.com/mrworthingtondate: 2023-08-13page-layout: fullformat: html: date_format: medium css: page_header.css fig-width: 8 toc-depth: 3 warning: false message: false code-fold: true code-tools: true code-link: truetitle-block-banner: truetitle-block-banner-color: "#ffffff"image: melissa-walker-horn-U7bOjNIqisM-unsplash.jpgcategories: - ggplot2 - databases - arrow - sql - duckdbfilters: - lightboxlightbox: auto---## Why This Post?At work, we're gearing up to start using SQL databases and, admittedly, it's been a while since I've used SQL. While I can write [dplyr](https://dplyr.tidyverse.org) in my sleep ([which is very similar](https://medium.com/mlearning-ai/dplyr-vs-sql-c7277abc9482#:~:text=SQL%20and%20dplyr%20both%20are,%3E%20val1%2C%20...)), I've been wanting to get some practice rounds in to refresh my SQL familiarity with interesting data. In part, it's for me but we also have folks on our team that are learning how to code and some will be using SQL for the first time. Coincidentally, Spatial Data Wizard Kyle Walker, posted this on twitter which caught my eye:```{=html}<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Many data scientists find that SQL and databases are crucial skills on the job. But how do you learn these skills coming from the <a href="https://twitter.com/hashtag/rstats?src=hash&ref_src=twsrc%5Etfw">#rstats</a> world?<br><br>Ch11 of my book gets you practice setting up, querying, and analyzing a 9m record database from <a href="https://twitter.com/ipums?ref_src=twsrc%5Etfw">@ipums</a>: <a href="https://t.co/0hhD3f30Wo">https://t.co/0hhD3f30Wo</a> <a href="https://t.co/0GW1tdAQYY">pic.twitter.com/0GW1tdAQYY</a></p>— Kyle Walker (@kyle_e_walker) <a href="https://twitter.com/kyle_e_walker/status/1686762575306907648?ref_src=twsrc%5Etfw">August 2, 2023</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>```## Using Arrow + DuckDB for SQL PracticeSo I got the tutorial going and right about the time I started importing the IPUMS data into my PostgreSQL with DBeaver, I realized that loading it would take 2 hours and opted for something that could let me play with the data in SQL much faster.Enter [Arrow](https://arrow.apache.org/docs/r/) + [DuckDB](https://duckdb.org). If you haven't heard of Arrow or DuckDB, they're fairly new tools to R's world of wizardry. They work across multiple languages, but I'll be using their R docs for now because they're really good.So what are they? Well, Arrow bills itself as making it easy to work with "in-memory and larger-than-memory data" while DuckDB is an "in-process SQL OLAP database management system". In layman's terms, their purpose is for handling ridiculously large data within your local environment that may otherwise only be possible using larger, more involved database management systems. Together, they "are capable of executing on data without fully loading it from disk".That last part is key for me right now, because I don't have 2 hours to load the export from Kyle's tutorial in PostgreSQL before practicing. Granted, the data is pretty massive at 126,567,125 rows and 17 columns. I can't even load the 10GB CSV extract on my M1 MacBook Pro with `read_csv()`. If I try, RStudio's console casually starts out like this...``` rindexing usa_00002.csv [========================================================--] 505.37MB/s, eta: 1s```Then, it just kind of hangs. After several minutes, RStudio finally taps out and says:``` r> ipums_csv <-read_csv("usa_00002.csv")Error: vector memory exhausted (limit reached?) ```So how do we load this into a database that will let us run SQL queries on Kyle's gigantic tutorial data? Fortunately, Arrow + DuckDB's R Packages let us easily read our CSV file in as an Arrow Table, explore it a bit, and then load it as a database that will allow us to run SQL queries on it. [^1][^1]: For a good tutorial on this approach, visit [this page on DuckDB's blog](%22https://duckdb.org/2021/12/03/duck-arrow.html).::: {.callout-note appearance="simple"}## For context, I'm going to use a parquet format (which is much lighter than CSV), but instead of `read_parquet()`, you can use `read_csv_arrow()` on the CSV export from IPUMS to get the same results. It will just result in slightly slower load times.:::```{r}#| echo: true#| code-fold: show#| class-output: sourceCode r code-with-copy#| code-summary: "**Show the code** | Build A SQL Database with Arrow + DuckDB"library(arrow)library(duckdb)## Reads Our IPUMS Parquet into An Arrow Tableipums_db <- arrow::read_parquet("raw_data/ipums_export.parquet", as_data_frame =FALSE)# Establish A Database Connection with DuckDBcon <-dbConnect(duckdb::duckdb())# Registers arrow table as a DuckDB viewarrow::to_duckdb(ipums_db, table_name ="ipums_db", con = con)```## Exploratory Analysis with SQL Queries On DuckDBGenerally, I know what's in this data export. It's a complete-count census file from 1910. It has features that tell us about the population (like age, sex, state, marital status, literacy, etc.) but now that it's loaded, I want to do some high-level exploration of what's in the 126 million rows and 17 columns. As an eyebrow raiser, I'll note that Kyle's tutorial says the export should have 92 million rows. So what gives?### Previewing Our DataFor starters, I need to recall all of the features we exported so I know exactly what I'm working with before I identify why our sample is different than the one Kyke reported. It's technically available above, but I want a small look at just 10 rows, so I'm gonna pull that sample and preview all table columns:```{sql connection = "con"}#| echo: true#| code-fold: show#| class-output: sourceCode sql code-with-copy#| code-summary: "**Show the code** | Previewing 10 Observations from our DB"SELECT*FROM ipums_db LIMIT10```### Filtering On Columns#### Single ColumnOkay, so a quick look at our sample reveals we accidentally grabbed more than the 1910 data from Kyle's tutorial, which may explain the discrepancy in sample sizes. Let's filter it down to match Kyle's reported figure of 92 million rows and count it to verify since ours is 126 million.```{sql connection = "con"}#| echo: true#| code-fold: show#| class-output: sourceCode sql code-with-copy#| code-summary: "**Show the code** | Counting All Observations From 1910"SELECTcount(*) FROM ipums_dbWHEREYEAR=1910```It's a match! So now, let's re-look at a small sample with the query below, using only 1910 data. When we run it, it's easy to see that some features now have data (like marital status) whereas they didn't before. Likely, this may have been because the 1850 complete-count census had not included that question. Another big difference between the two is the `HHWT` column, which tells us how many households are represented in this count. For the 1850 census, they reported only accounting for 97% of US Households in their complete-count data. For the 1910 census, we see that 100% of US Households are accounted for.```{sql connection = "con"}#| echo: true#| code-fold: show#| class-output: sourceCode sql code-with-copy#| code-summary: "**Show the code** | Filtering on A Single Column"SELECT*FROM ipums_dbWHEREYEAR=1910LIMIT10```#### Multiple ColumnsDeviating a bit from Kyle's tutorial a bit, I'm going to do an aside and explore age distributions in Texas for this example. To do that, we can start by filtering for Texans within the 1910 census sample using the `STATEFIP` Feature, which uses "FIPS" codes to classify states. Texas is number 48. ::: asideNow, I did count this sample outside of the tutorial (just to keep it brief) and the count for Texas in 1910 was around 3.9 million. Just over 100 years later, we're now over 30 million. That's pretty wild.:::```{sql connection = "con"}#| echo: true#| code-fold: show#| class-output: sourceCode sql code-with-copy#| code-summary: "**Show the code** | Filtering on Multiple Columns"SELECTYEAR, SAMPLE, STATEFIP, SEX, AGE, LIT, MARST FROM ipums_dbWHEREYEAR=1910AND STATEFIP =48LIMIT10```### Grouping + Summary OperationsOkay, so now we've previewed our Texas data, let's build off our query above and use `GROUP BY` on the `AGE` and `SEX` variables. Interestingly, our dataset says there were a couple of folks who lived to see 120 and one woman who lived to 130 in 1910. Now, I think these folks were lying about their birth certificates, but whatever. I'm gonna set that aside for now and work with it. ```{sql connection = "con"}#| echo: true#| code-fold: show#| class-output: sourceCode sql code-with-copy#| code-summary: "**Show the code** | Counting Texans by Age + Gender in 1910"SELECT SEX, AGE, COUNT(*) AS"count"FROM ipums_dbWHEREYEAR=1910AND STATEFIP =48GROUPBY AGE, SEXORDERBY AGE```### Logical OperationsSo, maybe you're wondering how I knew our 130yo Texan was a woman. And it's because the IPUMS has a code book, which articulates the classification system used for certain features. In this case, it's 1 for Male and 2 for Female. That's all good and well, but I don't think it's readable and it'd be a lot better if we just reclassified these codes. We can do this by building off our query with `CASE` & `WHEN````{sql}#| code-fold: show#| connection: con#| output.var: as_query#| cache: true#| class-output: sourceCode sql code-with-copy#| code-summary: "**Show the code** | Reclassifying from Coded Gender Vars"SELECT AGE as"age",CASEWHEN SEX =1then'Male'WHEN SEX =2then'Female'ELSE'Not Specified'ENDAS gender,COUNT(*) AS"count"FROM ipums_dbWHEREYEAR=1910AND STATEFIP =48GROUPBY age, genderORDERBY age```### Visualizing Our QueryFrom here, I may want to take this data and visualize it. Unfortunately, SQL really isn't the best place to do that, but I can just pass this data to R or Python with Quarto and start going to work on some nice charts. In my case, I've tried to line up these two distributions and compare them by decade and gender for a simple visual comparison. While imagining the visualization in my mind, I decided to do some extra transformation on the 253 rows I pulled. It was simple enough to quickly do that within R and pass it to our chart for the comparison. At a quick glance, we can some similarities in population size before age 30. Once 30 hits, the gaps start to widen. We could explore that more, but that's not this post! So I'll stop elaborating on this visualization for now.```{r}#| code-fold: true#| connection: con#| output.var: as_query#| fig-width: 12#| fig-height: 9#| cache: true#| column: body-outset#| code-summary: "**Show the code** | R Code to Visualize Our Query Data"library(tidyverse)chart_query_df <- as_query |>mutate(decade =cut(age, breaks =c(seq(0, max(age) +9, by =10), Inf), right =FALSE, labels =FALSE, include.lowest =TRUE),decade = decade-1,decade_lbl =if_else(decade==0,paste0(decade, "-", decade+1, "0"),paste0(decade, "0-", decade+1, "0"))) |>group_by(decade, decade_lbl, gender) |>summarize(count =sum(count))chart_query_df |>ggplot() +aes(x=reorder(decade_lbl, decade), y = count, fill = gender) +geom_col(alpha =0.8, position =position_dodge()) +geom_col(data = chart_query_df |>filter(decade >=30), position =position_dodge(),) +scale_y_continuous(labels = scales::label_comma(scale = .001, suffix ="K"),n.breaks =6) +scale_fill_manual(values =c(Male ="#005f86", Female ="#bf5700")) + mrworthingtonR::theme_mrw(base_family ="Graphik",subtitle_size =18) +theme(plot.title = ggtext::element_markdown(margin = ggplot2::margin(b =5), hjust =0),panel.grid.major.y =element_line(color =alpha("#8d8d8d", 0.2), linetype =1)) +labs(title ="In 1910, Texas <span style='color:#005f86'>Men</span> Seemed to Live Longer than <span style='color:#bf5700'>Women</span> After Age 30",subtitle ="Distributions Among Texan Ages, by Gender | 1910 Complete US Census Count",y ="Population Count",x ="Age Group",caption ="Source: IPUMS 1910 Complete-Count US Census File (CSV)")```## Concluding ThoughtsAnd that's it for me. There's obviously more I could do more with the chart itself and there's far more questions to ask about the underlying data (which Kyle does on his tutorial), but I'm going to pause here since this was a post to show how you can use R, Arrow, + DuckDB to wrangle gigantic volumes of data and derive insights with SQL queries! So, if you were looking for a quick database environment going so you can practice SQL, you can use this same approach with virtually any other data. Just load it into a DuckDB and start running SQL query chunks within a Quarto Doc! If you have any questions, revisions, or suggestions, just shoot me a message [on twitter or X or whatever it is now](https://twitter.com/mrworthington).## Acknowledgments {.appendix}Photo by <ahref="https://unsplash.com/@sugercoatit?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Melissa Walker Horn</a> on <ahref="https://unsplash.com/photos/U7bOjNIqisM?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>