GitHub

Warning: tidyABS is in early development. Please use for experimentation only.

Overview

The tidyABS package converts ABS excel tables to tidy data frames. It uses information on cells’ formatting (indenting, bolding, italics) and some rules-of-thumb to detect the structure of excel tables. It sometimes needs pointers from the user.

Note: tidyABS works with .xlsx files only.

Installation

The tidyABS package is not available on CRAN. It can be installed from github with the following script:

# install.packages("devtools")
devtools::install_github("ianmoran11/tidyABS")

Usage

library(tidyABS)
library(tidyverse)

Below is a short demonstration that tidies a table from the Australian Industry publication (Cat No. 8155.0).

tidy_aus_industry_df <-
  tidyABS_example("australian-industry.xlsx") %>%
  process_sheet(sheets = "Table_1") %>%
  assemble_table_components()

tidy_aus_industry_df %>% str()
#> Classes 'tbl_df', 'tbl' and 'data.frame':    1668 obs. of  8 variables:
#>  $ row         : int  8 8 8 8 8 8 8 8 9 9 ...
#>  $ col         : int  2 3 4 5 6 7 8 9 2 3 ...
#>  $ comment     : chr  NA NA NA NA ...
#>  $ value       : chr  "485" "5843" "54410" "57577" ...
#>  $ col_group_01: chr  "Employment at end of June" "Wages and salaries" "Sales and service income" "Total income" ...
#>  $ col_group_02: chr  "'000" "$m" "$m" "$m" ...
#>  $ row_group_01: chr  "2006–07" "2006–07" "2006–07" "2006–07" ...
#>  $ row_group_02: chr  "AGRICULTURE, FORESTRY AND FISHING" "AGRICULTURE, FORESTRY AND FISHING" "AGRICULTURE, FORESTRY AND FISHING" "AGRICULTURE, FORESTRY AND FISHING" ...

Examples

The tidyABS package contains several example files. Use the helper tidyABS_example() function with no arguments to list these files:

tidyABS_example()
#> [1] "~$australian-industry.xlsx"                          
#> [2] "australian-industry.xlsx"                            
#> [3] "consumer-price-index.xlsx"                           
#> [4] "environmental-economic-accounts.xlsx"                
#> [5] "PhD_ subfield-citizenship-status-ethnicity-race.xlsx"
#> [6] "PhD_major-field.xlsx"

Example 1: Australian Industry

Above is the first sheet of an excel workbook in the Australian Industry publication. We can retrieve the path of this file using the tidyABS_example function:

ai_path <- tidyABS_example("australian-industry.xlsx")

To process the sheet above, we pass the workbook file path to the process_sheet function and identify the sheet we’d like to tidy.

ai_processed <- process_sheet(path = ai_path, sheets = "Table_1")

ai_processed %>% str(1)
#> List of 3
#>  $ col_groups:Classes 'tbl_df', 'tbl' and 'data.frame':  2 obs. of  10 variables:
#>  $ row_groups:Classes 'tbl_df', 'tbl' and 'data.frame':  2 obs. of  11 variables:
#>  $ tabledata :Classes 'tbl_df', 'tbl' and 'data.frame':  1668 obs. of  4 variables:

This produces a list of three data frames. They store the location and format information of row names (row_groups), column names (col_groups) and table data (tabledata).

We can inspect the row and column names in ai_processed using the inspect_table_components function.

inspect_table_components(ai_processed)
#> $col_groups
#> $col_groups[[1]]
#> [1] "Employment at end of June"                                 
#> [2] "Wages and salaries"                                        
#> [3] "Sales and service income"                                  
#> [4] "Total income"                                              
#> [5] "Total expenses"                                            
#> [6] "Operating profit before tax"                               
#> [7] "Earnings before interest tax depreciation and amortisation"
#> [8] "Industry value added"                                      
#> 
#> $col_groups[[2]]
#> [1] "'000" "$m"  
#> 
#> 
#> $row_groups
#> $row_groups[[1]]
#>  [1] "2006–07" "2007–08" "2008–09" "2009–10" "2010–11" "2011–12" "2012–13"
#>  [8] "2013–14" "2014–15" "2015–16" "2016–17"
#> 
#> $row_groups[[2]]
#>  [1] "AGRICULTURE, FORESTRY AND FISHING"              
#>  [2] "MINING"                                         
#>  [3] "MANUFACTURING"                                  
#>  [4] "ELECTRICITY, GAS, WATER AND WASTE SERVICES"     
#>  [5] "CONSTRUCTION"                                   
#>  [6] "WHOLESALE TRADE"                                
#>  [7] "RETAIL TRADE"                                   
#>  [8] "ACCOMMODATION AND FOOD SERVICES"                
#>  [9] "TRANSPORT, POSTAL AND WAREHOUSING"              
#> [10] "INFORMATION MEDIA AND TELECOMMUNICATIONS"       
#> [11] "RENTAL, HIRING AND REAL ESTATE SERVICES"        
#> [12] "PROFESSIONAL, SCIENTIFIC AND TECHNICAL SERVICES"
#> [13] "ADMINISTRATIVE AND SUPPORT SERVICES"            
#> [14] "PUBLIC ADMINISTRATION AND SAFETY (PRIVATE)"     
#> [15] "EDUCATION AND TRAINING (PRIVATE)"               
#> [16] "HEALTH CARE AND SOCIAL ASSISTANCE (PRIVATE)"    
#> [17] "ARTS AND RECREATION SERVICES"                   
#> [18] "OTHER SERVICES"                                 
#> [19] "TOTAL SELECTED INDUSTRIES"                      
#> 
#> 
#> $tabledata
#> list()

We can use plot_table_components to visualize how these groups are spatially layed out in the spreadsheet and see how tidyABS will relate them to table values. Row names directly to the left of their data points should be labelled “W”, and column names directly above should be labelled “N”. (See the unpivotr package for more information.)

plot_table_components(ai_processed) +
  ylim(-30, 0)

Finally, we can assemble the components into a tidy data frame using assemble_table_components.

assemble_table_components(ai_processed) %>%
  glimpse()
#> Observations: 1,668
#> Variables: 8
#> $ row          <int> 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 1...
#> $ col          <int> 2, 3, 4, 5, 6, 7, 8, 9, 2, 3, 4, 5, 6, 7, 8, 9, 2...
#> $ comment      <chr> NA, NA, NA, NA, NA, "estimate has a relative stan...
#> $ value        <chr> "485", "5843", "54410", "57577", "52046", "5461",...
#> $ col_group_01 <chr> "Employment at end of June", "Wages and salaries"...
#> $ col_group_02 <chr> "'000", "$m", "$m", "$m", "$m", "$m", "$m", "$m",...
#> $ row_group_01 <chr> "2006–07", "2006–07", "2006–07", "2006–07", "2006...
#> $ row_group_02 <chr> "AGRICULTURE, FORESTRY AND FISHING", "AGRICULTURE...

Example 2: Environmental-Economic Accounts

Here’s an example that requires some manual work, the Environmental-Economic Accounts. Let’s retrieve the path of our example workbook and proces Table 6.1:

eea_path <- tidyABS_example("environmental-economic-accounts.xlsx")

eea_processed <- process_sheet(path = eea_path, sheets = "Table 6.1")

On visual inspection, we can see row_group_01 has been given a “W” orientation to the data, not “WNW”.

plot_table_components(eea_processed)

Luckily, we can fix this with the change_direction function.

eea_processed <-
  eea_processed %>%
  change_direction("row_group_01", "WNW")

Plotting the table confirms the direction has been corrected.

plot_table_components(eea_processed)

Finally, we can assemble the components into a tidy dataframe using assemble_table_components.

assemble_table_components(eea_processed) %>%
  glimpse()
#> Observations: 156
#> Variables: 8
#> $ row          <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8...
#> $ col          <int> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 2, 3,...
#> $ comment      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
#> $ value        <chr> "13532", "14352", "14075", "14658", "15085", "155...
#> $ col_group_01 <chr> "2003–04", "2004–05", "2005–06", "2006–07", "2007...
#> $ row_group_01 <chr> "Energy Taxes ", "Energy Taxes ", "Energy Taxes "...
#> $ row_group_02 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
#> $ row_group_03 <chr> "Crude oil and LPG", "Crude oil and LPG", "Crude ...

Example 3: Consumer Price Index (time series)

Time series data require the user to manually identify the inner table cells. This is because some of the column names are numeric — for example, collection month.

I recommend using the readABS package for this. It was created for importing ABS time series data and does not require manual identifcation of table cells.

That said, here’s how you would process this table with tidyABS.

cpi_path <- tidyABS_example("consumer-price-index.xlsx")

We need to identify the inner table cells using the manual_value_references argument.This argument takes a vector of addresses, identifying the inner corners of the table.

cpi_processed <-
  process_sheet(
    path = cpi_path, sheets = "Data1",
    manual_value_references = "B11:AB292"
  )

Here is the resulting data frame.

assemble_table_components(cpi_processed) %>%
  glimpse()
#> Observations: 7,185
#> Variables: 15
#> $ row          <int> 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 1...
#> $ col          <int> 2, 3, 4, 5, 6, 7, 9, 10, 2, 3, 4, 5, 6, 7, 9, 10,...
#> $ comment      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
#> $ value        <chr> "3.7", "3.8", "3.7", "3.8", "3.7", "3.8", "3.9", ...
#> $ col_group_01 <chr> "Index Numbers ;  All groups CPI ;  Sydney ;", "I...
#> $ col_group_02 <chr> "Index Numbers", "Index Numbers", "Index Numbers"...
#> $ col_group_03 <chr> "Original", "Original", "Original", "Original", "...
#> $ col_group_04 <chr> "INDEX", "INDEX", "INDEX", "INDEX", "INDEX", "IND...
#> $ col_group_05 <chr> "Quarter", "Quarter", "Quarter", "Quarter", "Quar...
#> $ col_group_06 <chr> "3", "3", "3", "3", "3", "3", "3", "3", "3", "3",...
#> $ col_group_07 <chr> "1948-09-01", "1948-09-01", "1948-09-01", "1948-0...
#> $ col_group_08 <chr> "2018-12-01", "2018-12-01", "2018-12-01", "2018-1...
#> $ col_group_09 <chr> "282", "282", "282", "282", "282", "282", "282", ...
#> $ col_group_10 <chr> "A2325806K", "A2325811C", "A2325816R", "A2325821J...
#> $ row_group_01 <chr> "1948-09-01", "1948-09-01", "1948-09-01", "1948-0...

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
R		R
data/ABS		data/ABS
figure		figure
inst/extdata		inst/extdata
man		man
revdep		revdep
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
00_package-setup.R		00_package-setup.R
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README-unnamed-chunk-10-1.png		README-unnamed-chunk-10-1.png
README-unnamed-chunk-11-1.png		README-unnamed-chunk-11-1.png
README-unnamed-chunk-12-1.png		README-unnamed-chunk-12-1.png
README-unnamed-chunk-13-1.png		README-unnamed-chunk-13-1.png
README-unnamed-chunk-14-1.png		README-unnamed-chunk-14-1.png
README-unnamed-chunk-15-1.png		README-unnamed-chunk-15-1.png
README-unnamed-chunk-16-1.png		README-unnamed-chunk-16-1.png
README-unnamed-chunk-17-1.png		README-unnamed-chunk-17-1.png
README-unnamed-chunk-19-1.png		README-unnamed-chunk-19-1.png
README-unnamed-chunk-6-1.png		README-unnamed-chunk-6-1.png
README-unnamed-chunk-7-1.png		README-unnamed-chunk-7-1.png
README-unnamed-chunk-9-1.png		README-unnamed-chunk-9-1.png
README.md		README.md
README.rmd		README.rmd
australian-industry_table-1.png		australian-industry_table-1.png
consumer-price-index_data01.png		consumer-price-index_data01.png
environmental-economic-accounts_table-6-01.png		environmental-economic-accounts_table-6-01.png
tidyABS.Rproj		tidyABS.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Overview

Installation

Usage

Examples

Example 1: Australian Industry

Example 2: Environmental-Economic Accounts

Example 3: Consumer Price Index (time series)

About

Licenses found

Releases

Packages

Languages

License

Licenses found

ianmoran11/tidyABS

Folders and files

Latest commit

History

Repository files navigation

Overview

Installation

Usage

Examples

Example 1: Australian Industry

Example 2: Environmental-Economic Accounts

Example 3: Consumer Price Index (time series)

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages