Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom table significance / Count of missing values in total rows #110

Open
JB0207 opened this issue Apr 5, 2024 · 5 comments
Open

Custom table significance / Count of missing values in total rows #110

JB0207 opened this issue Apr 5, 2024 · 5 comments

Comments

@JB0207
Copy link

JB0207 commented Apr 5, 2024

Dear @gdemin,

I have two questions:

  1. I would like to conduct significance tests in a custom table. As far as I understood from the documentation the underlying command of tab_last_sig_cpct is prop.test. But when comparing the results, I can not reproduce the result of tab_last_sig_cpct (e.g. red boxes in picture). Am I conducting the significance tests wrong?

Results:
image

My Code:

table1 <- DF %>% 
  tab_cols( 
    total(),
    X,
  ) %>%
  tab_cells( 
    C1,
    subtotal(Y1 %to% Y2, "Bottom 2 Box" = 1:2, "Top 2 Box" = 4:5, position = "bottom") 
  ) %>%
  tab_stat_cases(label = "Cases", total_row_position = "below") %>%
  tab_last_sig_cpct(as_spss = T) %>%
  tab_pivot(stat_position = "inside_columns") %>%
  drop_empty_rows() %>% 
  drop_empty_columns() 

  1. Is there a way to display the count of NA values of a variable above or below the total rows?

Many thanks in advance.

@gdemin
Copy link
Owner

gdemin commented Apr 6, 2024

Hi!

tab_last_sig_cpct works only on percents. When you apply it on the table with counts, as in your case, it gives incorrect results.

As for count of NA:

library(expss)

data("mtcars")

mtcars$am[1:3] = NA

var_lab(mtcars$am) = 'Transmission'

count_na = function(x, weight = NULL){
    if(NCOL(x)>1){
        # for multiple choice
        vec = rowSums(!is.na(x))==0
        
    } else {
        vec = is.na(x)
    }
    w_sum(vec, weight = weight)
}

mtcars %>% 
    tab_cols(total(), vs) %>% 
    tab_cells(am) %>% 
    tab_stat_cpct() %>% 
    tab_stat_fun(count_na, label = 'NA') %>% 
    tab_pivot()

@JB0207
Copy link
Author

JB0207 commented Apr 8, 2024

Thank you for your reply, clarification on tab_last_sig_cpct and your suggestion for a solution counting NAs. Unfortunately, the function does not work correctly in combination with subtotal. Example code for error replication:

# set digits
expss_digits(digits = 2)

# set output option
expss_output_viewer()

# for replication
set.seed(123)

test_df <- data.frame(
  Y1 = sample(1:3, 100, replace = TRUE),
  Q1 = sample(0:1, 100, replace = TRUE),
  Q3 = sample(c(1:5, NA), 100, replace = TRUE, prob = c(rep(1, 5), 0.1)),
  Q4 = sample(c(1:5, NA), 100, replace = TRUE, prob = c(rep(1, 5), 0.1))
)

var_lab(test_df$Q1) = "Test1"
var_lab(test_df$Q3) = "Test2"
var_lab(test_df$Q4) = "Test3"

test_df %>% 
  tab_cols( 
    total(),
    Y1
  ) %>%
  tab_cells( 
    Q1,
    subtotal(Q3 %to% Q4, "Bottom 2 Box" = 1:2, "Top 2 Box" = 4:5, position = "bottom") 
  ) %>%
  tab_stat_cpct(label="%", total_statistic = "w_cpct", total_label = "#Total cases") %>%
  tab_stat_fun(count_na, label = 'NA') %>% 
  tab_pivot()

But when I specified NA within the subtotal command, it actually worked, but the count of NAs is displayed twice as category 6 (not wanted) and as defined by me with "NA". With hide I could not hide the unwanted category 6 in the output. Is there any way to do this, or is it possible to avoid the automatic setting of a value label?

Issue_Github2

Code:

test_df %>%
  tab_cols(total(),
           Y1) %>%
  tab_cells(Q1,
            subtotal(Q3 %to% Q4, "Bottom 2 Box" = 1:2, "Top 2 Box" = 4:5, "NA" = NA, position = "bottom")
            ) %>%
  tab_stat_cases(label = "Cases", total_row_position = "below") %>%
  tab_stat_cpct(label = "%", total_statistic = "w_cpct", total_label = "#Total cases") %>%
  tab_pivot(stat_position = "inside_columns")

@gdemin
Copy link
Owner

gdemin commented Apr 15, 2024

Which version of the expss dou you use? hide works for me:

test_df %>%
    tab_cols(total(),
             Y1) %>%
    tab_cells(Q1,
              subtotal(Q3 %to% Q4, "Bottom 2 Box" = 1:2, "Top 2 Box" = 4:5, "NA" = hide(NA), position = "bottom")
    ) %>%
    tab_stat_cases(label = "Cases", total_row_position = "below") %>%
    tab_stat_cpct(label = "%", total_statistic = "w_cpct", total_label = "#Total cases") %>%
    tab_pivot(stat_position = "inside_columns")


 # |       |              | #Total |     |    Y1 |        |       |        |       |        |
 # |       |              |  Cases |   % |     1 |        |     2 |        |     3 |        |
 # |       |              |        |     | Cases |      % | Cases |      % | Cases |      % |
 # | ----- | ------------ | ------ | --- | ----- | ------ | ----- | ------ | ----- | ------ |
 # | Test1 |            0 |     44 |  44 |    16 |  48.48 |    11 |  34.38 |    17 |  48.57 |
 # |       |            1 |     56 |  56 |    17 |  51.52 |    21 |  65.62 |    18 |  51.43 |
 # |       | #Total cases |    100 | 100 |    33 | 100.00 |    32 | 100.00 |    35 | 100.00 |
 # | Test2 |            1 |     16 |  16 |     5 |  15.15 |     9 |  28.12 |     2 |   5.71 |
 # |       |            2 |     23 |  23 |     7 |  21.21 |     7 |  21.88 |     9 |  25.71 |
 # |       |            3 |     20 |  20 |     5 |  15.15 |     7 |  21.88 |     8 |  22.86 |
 # |       |            4 |     21 |  21 |     7 |  21.21 |     5 |  15.62 |     9 |  25.71 |
 # |       |            5 |     19 |  19 |     9 |  27.27 |     4 |  12.50 |     6 |  17.14 |
 # |       | Bottom 2 Box |     39 |  39 |    12 |  36.36 |    16 |  50.00 |    11 |  31.43 |
 # |       |    Top 2 Box |     40 |  40 |    16 |  48.48 |     9 |  28.12 |    15 |  42.86 |
 # |       |           NA |      1 |   1 |       |        |       |        |     1 |   2.86 |
 # |       | #Total cases |    100 | 100 |    33 | 100.00 |    32 | 100.00 |    35 | 100.00 |
 # | Test3 |            1 |     25 |  25 |     5 |  15.15 |    11 |  34.38 |     9 |  25.71 |
 # |       |            2 |     19 |  19 |     6 |  18.18 |     6 |  18.75 |     7 |  20.00 |
 # |       |            3 |     15 |  15 |     9 |  27.27 |     5 |  15.62 |     1 |   2.86 |
 # |       |            4 |     19 |  19 |     8 |  24.24 |     5 |  15.62 |     6 |  17.14 |
 # |       |            5 |     17 |  17 |     4 |  12.12 |     4 |  12.50 |     9 |  25.71 |
 # |       | Bottom 2 Box |     44 |  44 |    11 |  33.33 |    17 |  53.12 |    16 |  45.71 |
 # |       |    Top 2 Box |     36 |  36 |    12 |  36.36 |     9 |  28.12 |    15 |  42.86 |
 # |       |           NA |      5 |   5 |     1 |   3.03 |     1 |   3.12 |     3 |   8.57 |
 # |       | #Total cases |    100 | 100 |    33 | 100.00 |    32 | 100.00 |    35 | 100.00 |

@JB0207
Copy link
Author

JB0207 commented Apr 15, 2024

Ah, I was using hide incorrectly. Now it works! Is there a way to remove the NA count from the total number of cases and the percentage column so that for Variable Q3 ("Test2") n=99 is used as the base and for Variable Q4 ("Test3") n=95 is used as the base?

@gdemin
Copy link
Owner

gdemin commented May 7, 2024

Sorry for delay. Below the code - it is not very nice but works. The idea is to reorder rows according to variable labels.

# set digits
library(expss)
expss_digits(digits = 2)

# set output option
expss_output_default()

count_na = function(x, weight = NULL){
    if(NCOL(x)>1){
        # for multiple choice
        vec = rowSums(!is.na(x))==0
        
    } else {
        vec = is.na(x)
    }
    w_sum(vec, weight = weight)
}

tab_reorder = function(data){
    to_split = gsub('^([^|]+).+$', "\\1", data$row_labels)
    keep_order = unique(to_split)
    res = split(data, to_split) 
    do.call(add_rows, res[keep_order])   
}

# for replication
set.seed(123)

test_df <- data.frame(
    Y1 = sample(1:3, 100, replace = TRUE),
    Q1 = sample(0:1, 100, replace = TRUE),
    Q3 = sample(c(1:5, NA), 100, replace = TRUE, prob = c(rep(1, 5), 0.1)),
    Q4 = sample(c(1:5, NA), 100, replace = TRUE, prob = c(rep(1, 5), 0.1))
)

var_lab(test_df$Q1) = "Test1"
var_lab(test_df$Q3) = "Test2"
var_lab(test_df$Q4) = "Test3"

test_df %>% 
    tab_cols( 
        total(),
        Y1
    ) %>%
    tab_cells( 
        Q1,
        subtotal(Q3 %to% Q4, "Bottom 2 Box" = 1:2, "Top 2 Box" = 4:5, position = "bottom") 
    ) %>%
    tab_stat_cpct(label="%", total_statistic = "w_cpct", total_label = "#Total cases") %>%
    tab_cells(Q1, Q3 %to% Q4) %>%  
    tab_stat_fun(count_na, label = 'NA') %>% 
    tab_pivot() %>% 
    tab_reorder()
    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants