Warning: tidyABS is in early development. Please use for experimentation only.
The tidyABS
package converts ABS excel tables to tidy data frames. It
uses information on cells’ formatting (indenting, bolding, italics) and
some rules-of-thumb to detect the structure of excel tables. It
sometimes needs pointers from the user.
Note: tidyABS works with .xlsx files only.
The tidyABS package is not available on CRAN. It can be installed from github with the following script:
# install.packages("devtools")
devtools::install_github("ianmoran11/tidyABS")
library(tidyABS)
library(tidyverse)
Below is a short demonstration that tidies a table from the Australian Industry publication (Cat No. 8155.0).
tidy_aus_industry_df <-
tidyABS_example("australian-industry.xlsx") %>%
process_sheet(sheets = "Table_1") %>%
assemble_table_components()
tidy_aus_industry_df %>% str()
#> Classes 'tbl_df', 'tbl' and 'data.frame': 1668 obs. of 8 variables:
#> $ row : int 8 8 8 8 8 8 8 8 9 9 ...
#> $ col : int 2 3 4 5 6 7 8 9 2 3 ...
#> $ comment : chr NA NA NA NA ...
#> $ value : chr "485" "5843" "54410" "57577" ...
#> $ col_group_01: chr "Employment at end of June" "Wages and salaries" "Sales and service income" "Total income" ...
#> $ col_group_02: chr "'000" "$m" "$m" "$m" ...
#> $ row_group_01: chr "2006–07" "2006–07" "2006–07" "2006–07" ...
#> $ row_group_02: chr "AGRICULTURE, FORESTRY AND FISHING" "AGRICULTURE, FORESTRY AND FISHING" "AGRICULTURE, FORESTRY AND FISHING" "AGRICULTURE, FORESTRY AND FISHING" ...
The tidyABS package contains several example files. Use the helper
tidyABS_example()
function with no arguments to list these files:
tidyABS_example()
#> [1] "~$australian-industry.xlsx"
#> [2] "australian-industry.xlsx"
#> [3] "consumer-price-index.xlsx"
#> [4] "environmental-economic-accounts.xlsx"
#> [5] "PhD_ subfield-citizenship-status-ethnicity-race.xlsx"
#> [6] "PhD_major-field.xlsx"
Above is the first sheet of an excel workbook in the Australian Industry
publication. We can retrieve the path of this file using the
tidyABS_example
function:
ai_path <- tidyABS_example("australian-industry.xlsx")
To process the sheet above, we pass the workbook file path to the
process_sheet
function and identify the sheet we’d like to tidy.
ai_processed <- process_sheet(path = ai_path, sheets = "Table_1")
ai_processed %>% str(1)
#> List of 3
#> $ col_groups:Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 10 variables:
#> $ row_groups:Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 11 variables:
#> $ tabledata :Classes 'tbl_df', 'tbl' and 'data.frame': 1668 obs. of 4 variables:
This produces a list of three data frames. They store the location and
format information of row names (row_groups
), column names
(col_groups
) and table data (tabledata
).
We can inspect the row and column names in ai_processed
using the
inspect_table_components
function.
inspect_table_components(ai_processed)
#> $col_groups
#> $col_groups[[1]]
#> [1] "Employment at end of June"
#> [2] "Wages and salaries"
#> [3] "Sales and service income"
#> [4] "Total income"
#> [5] "Total expenses"
#> [6] "Operating profit before tax"
#> [7] "Earnings before interest tax depreciation and amortisation"
#> [8] "Industry value added"
#>
#> $col_groups[[2]]
#> [1] "'000" "$m"
#>
#>
#> $row_groups
#> $row_groups[[1]]
#> [1] "2006–07" "2007–08" "2008–09" "2009–10" "2010–11" "2011–12" "2012–13"
#> [8] "2013–14" "2014–15" "2015–16" "2016–17"
#>
#> $row_groups[[2]]
#> [1] "AGRICULTURE, FORESTRY AND FISHING"
#> [2] "MINING"
#> [3] "MANUFACTURING"
#> [4] "ELECTRICITY, GAS, WATER AND WASTE SERVICES"
#> [5] "CONSTRUCTION"
#> [6] "WHOLESALE TRADE"
#> [7] "RETAIL TRADE"
#> [8] "ACCOMMODATION AND FOOD SERVICES"
#> [9] "TRANSPORT, POSTAL AND WAREHOUSING"
#> [10] "INFORMATION MEDIA AND TELECOMMUNICATIONS"
#> [11] "RENTAL, HIRING AND REAL ESTATE SERVICES"
#> [12] "PROFESSIONAL, SCIENTIFIC AND TECHNICAL SERVICES"
#> [13] "ADMINISTRATIVE AND SUPPORT SERVICES"
#> [14] "PUBLIC ADMINISTRATION AND SAFETY (PRIVATE)"
#> [15] "EDUCATION AND TRAINING (PRIVATE)"
#> [16] "HEALTH CARE AND SOCIAL ASSISTANCE (PRIVATE)"
#> [17] "ARTS AND RECREATION SERVICES"
#> [18] "OTHER SERVICES"
#> [19] "TOTAL SELECTED INDUSTRIES"
#>
#>
#> $tabledata
#> list()
We can use plot_table_components
to visualize how these groups are
spatially layed out in the spreadsheet and see how tidyABS
will relate
them to table values. Row names directly to the left of their data
points should be labelled “W”, and column names directly above should be
labelled “N”. (See the unpivotr
package for more information.)
plot_table_components(ai_processed) +
ylim(-30, 0)
Finally, we can assemble the components into a tidy data frame using
assemble_table_components
.
assemble_table_components(ai_processed) %>%
glimpse()
#> Observations: 1,668
#> Variables: 8
#> $ row <int> 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 1...
#> $ col <int> 2, 3, 4, 5, 6, 7, 8, 9, 2, 3, 4, 5, 6, 7, 8, 9, 2...
#> $ comment <chr> NA, NA, NA, NA, NA, "estimate has a relative stan...
#> $ value <chr> "485", "5843", "54410", "57577", "52046", "5461",...
#> $ col_group_01 <chr> "Employment at end of June", "Wages and salaries"...
#> $ col_group_02 <chr> "'000", "$m", "$m", "$m", "$m", "$m", "$m", "$m",...
#> $ row_group_01 <chr> "2006–07", "2006–07", "2006–07", "2006–07", "2006...
#> $ row_group_02 <chr> "AGRICULTURE, FORESTRY AND FISHING", "AGRICULTURE...
Here’s an example that requires some manual work, the
Environmental-Economic Accounts. Let’s retrieve the path of our example
workbook and proces Table 6.1
:
eea_path <- tidyABS_example("environmental-economic-accounts.xlsx")
eea_processed <- process_sheet(path = eea_path, sheets = "Table 6.1")
On visual inspection, we can see row_group_01
has been given a “W”
orientation to the data, not “WNW”.
plot_table_components(eea_processed)
Luckily, we can fix this with the change_direction
function.
eea_processed <-
eea_processed %>%
change_direction("row_group_01", "WNW")
Plotting the table confirms the direction has been corrected.
plot_table_components(eea_processed)
Finally, we can assemble the components into a tidy dataframe using
assemble_table_components
.
assemble_table_components(eea_processed) %>%
glimpse()
#> Observations: 156
#> Variables: 8
#> $ row <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8...
#> $ col <int> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 2, 3,...
#> $ comment <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
#> $ value <chr> "13532", "14352", "14075", "14658", "15085", "155...
#> $ col_group_01 <chr> "2003–04", "2004–05", "2005–06", "2006–07", "2007...
#> $ row_group_01 <chr> "Energy Taxes ", "Energy Taxes ", "Energy Taxes "...
#> $ row_group_02 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
#> $ row_group_03 <chr> "Crude oil and LPG", "Crude oil and LPG", "Crude ...
Time series data require the user to manually identify the inner table cells. This is because some of the column names are numeric — for example, collection month.
I recommend using the
readABS
package for this. It
was created for importing ABS time series data and does not require
manual identifcation of table cells.
That said, here’s how you would process this table with tidyABS
.
cpi_path <- tidyABS_example("consumer-price-index.xlsx")
We need to identify the inner table cells using the
manual_value_references
argument.This argument takes a vector of
addresses, identifying the inner corners of the table.
cpi_processed <-
process_sheet(
path = cpi_path, sheets = "Data1",
manual_value_references = "B11:AB292"
)
Here is the resulting data frame.
assemble_table_components(cpi_processed) %>%
glimpse()
#> Observations: 7,185
#> Variables: 15
#> $ row <int> 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 1...
#> $ col <int> 2, 3, 4, 5, 6, 7, 9, 10, 2, 3, 4, 5, 6, 7, 9, 10,...
#> $ comment <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
#> $ value <chr> "3.7", "3.8", "3.7", "3.8", "3.7", "3.8", "3.9", ...
#> $ col_group_01 <chr> "Index Numbers ; All groups CPI ; Sydney ;", "I...
#> $ col_group_02 <chr> "Index Numbers", "Index Numbers", "Index Numbers"...
#> $ col_group_03 <chr> "Original", "Original", "Original", "Original", "...
#> $ col_group_04 <chr> "INDEX", "INDEX", "INDEX", "INDEX", "INDEX", "IND...
#> $ col_group_05 <chr> "Quarter", "Quarter", "Quarter", "Quarter", "Quar...
#> $ col_group_06 <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3",...
#> $ col_group_07 <chr> "1948-09-01", "1948-09-01", "1948-09-01", "1948-0...
#> $ col_group_08 <chr> "2018-12-01", "2018-12-01", "2018-12-01", "2018-1...
#> $ col_group_09 <chr> "282", "282", "282", "282", "282", "282", "282", ...
#> $ col_group_10 <chr> "A2325806K", "A2325811C", "A2325816R", "A2325821J...
#> $ row_group_01 <chr> "1948-09-01", "1948-09-01", "1948-09-01", "1948-0...