Skip to content
This repository has been archived by the owner on Jun 24, 2020. It is now read-only.

Pareto formula for estimating the median given only subgroup totals #44

Open
anthonyjpesce opened this issue Jul 31, 2014 · 8 comments
Open

Comments

@anthonyjpesce
Copy link

Here's some SAS to get us started

*************************************************;
*** SAS program to calculate median HH income ***;
*** from table P80 of STF3A 1990 (steve doig) ***;
*************************************************;

*** assign lower limits of ranges and range sizes to
    variables ;

  RETAIN  L1 0 L2 5000 L3 10000
  L4  12500 L5  15000 L6  17500 L7  20000
  L8  22500 L9  25000 L10 27500 L11 30000
  L12 32500 L13 35000 L14 37500 L15 40000
  L16 42500 L17 45000 L18 47500 L19 50000
  L20 55000 L21 60000 L22 75000 L23 100000
  L24 125000 L25 150000

  I1 4999 I2 4999 I3 2499 I4 2499 I5 2499 I6 2499
  I7 2499 I8 2499 I9 2499 I10 2499
  I11 2499 I12 2499 I13 2499 I14 2499
  I15 2499 I16 2499 I17 2499 I18 2499
  I19 4999 I20 4999 I21 14999 I22 24999
  I23 24999 I24 24999 I25 1000 ;

*** put the data into arrays ;
  ARRAY LOW_LIMS {25} L1-L25;
  ARRAY INT_SIZE {25} I1-I25;

 COUNT1=0; CUMUL1=0;

*** read in the numbers from the income table ;

  ARRAY INCOME1{25} P80I1-P80I25 ;
  N1=SUM(OF P80I1-P80I25);
  IF N1 GT 0 THEN DO UNTIL (CUMUL1 GT N1/2);
     COUNT1+1;
     CUMUL1+INCOME1{COUNT1};
     END;
  ELSE CUMUL1=.;
  IF N1 GT 0 THEN DO;
    SIGMAFX1=CUMUL1-INCOME1{COUNT1};
    FMD1=INCOME1{COUNT1};
    LMD1=LOW_LIMS{COUNT1};
    I1=INT_SIZE{COUNT1};
    MEDIAN = LMD1+((N1/2-SIGMAFX1)/FMD1)*I1;
    END;
   ELSE MEDIAN = 0;
@anthonyjpesce
Copy link
Author

The top thing with the L1, L2, etc is a list of the breaks/groups of data. The second thing with the I1, I2, etc is a list of the width of those groups.

The rest is still a little bit of a mystery.

@palewire
Copy link
Contributor

Is there any kind of documentation out there on the web we can learn from?

@anthonyjpesce
Copy link
Author

Only in Sandy's brain AFAIK

@palewire
Copy link
Contributor

Here is the contents of the Excel file I've been told has been used as a model by others.

1406765909 04

Here are the formulas that appear in column E between rows 9 and 15.

=IF(SUM(B$2:B8)>SUM(B$2:B$14)/2,(ABS(SUM(B$2:B$14)/2-SUM(B$2:B7))/B8)*D8+C8,"")
=IF(SUM(B$2:B9)>SUM(B$2:B$14)/2,(ABS(SUM(B$2:B$14)/2-SUM(B$2:B8))/B9)*D9+C9,"")
=IF(SUM(B$2:B10)>SUM(B$2:B$14)/2,(ABS(SUM(B$2:B$14)/2-SUM(B$2:B9))/B10)*D10+C10,"")
=IF(SUM(B$2:B11)>SUM(B$2:B$14)/2,(ABS(SUM(B$2:B$14)/2-SUM(B$2:B10))/B11)*D11+C11,"")
=IF(SUM(B$2:B12)>SUM(B$2:B$14)/2,(ABS(SUM(B$2:B$14)/2-SUM(B$2:B11))/B12)*D12+C12,"")
=IF(SUM(B$2:B13)>SUM(B$2:B$14)/2,(ABS(SUM(B$2:B$14)/2-SUM(B$2:B12))/B13)*D13+C13,"")
=IF(SUM(B$2:B14)>SUM(B$2:B$14)/2,(ABS(SUM(B$2:B$14)/2-SUM(B$2:B13))/B14)*D14+C14,"")
=MIN(E2:E14)

@palewire
Copy link
Contributor

Here is a Wolfram Alpha entry on pareto distribution.

@palewire palewire changed the title Let's get a sane, documented version of the pareto formula for taking the median of grouped data pareto formula for estimating the median given only subgroup totals Jul 31, 2014
@palewire palewire changed the title pareto formula for estimating the median given only subgroup totals Pareto formula for estimating the median given only subgroup totals Jul 31, 2014
anthonyjpesce added a commit that referenced this issue Jul 31, 2014
@palewire
Copy link
Contributor

Here's a Wikipedia article that might be relevant. It would be great to see if we can find that Census source it cites.

@palewire
Copy link
Contributor

palewire commented Aug 1, 2014

@anthonyjpesce acquired a source document from the Census, which I have posted online here.

He writes:

They pointed me to pages 16 and 17 of this document. ... What we've been calling Pareto is actually linear interpolation (we'll have to rename that), though it seems they use both depending on the application. I think we're going to stick with linear for our purposes.

Here are those pages.

sipp-2008-panel-wave-05-core-source-and-accuracy-p16-large

sipp-2008-panel-wave-05-core-source-and-accuracy-p17-large

@nikisix
Copy link

nikisix commented Sep 25, 2015

I wonder how wrong doing a weighted average using the midpoints of each category is in general?
I ran the weighted average on the age group example listed above and came up with these results:
Pareto-Median: 35.219
LinearInterp-Median: 35.292
WeightedAverage-Median: 37.727

So, clearly it's further off, but it's much easier for me to calculate (I'm calcing these medians from large sql queries where multiple steps (like finding which category is the mid category) is much harder than in a sequential programing language).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants