|
| 1 | +Received: from PEXEE011B.vu.local (130.37.164.17) by PEXHB011B (130.37.236.65) |
| 2 | + with Microsoft SMTP Server (TLS) id 14.2.298.4; Sat, 27 Oct 2012 22:00:34 |
| 3 | + +0200 |
| 4 | +Received: from filter3-utr.mf.surf.net (195.169.124.154) by mailin.vu.nl |
| 5 | + (130.37.164.17) with Microsoft SMTP Server (TLS) id 14.2.298.4; Sat, 27 Oct |
| 6 | + 2012 22:00:34 +0200 |
| 7 | +Received: from balrog.mythic-beasts.com (balrog.mythic-beasts.com |
| 8 | + [93.93.130.6]) by filter3-utr.mf.surf.net (8.14.3/8.14.3/Debian-9.4) with |
| 9 | + ESMTP id q9RK0WdE002487 for < [email protected]>; Sat, 27 Oct 2012 22:00:33 |
| 10 | + +0200 |
| 11 | +Received: from [93.93.130.49] (helo=sphinx.mythic-beasts.com) by |
| 12 | + balrog.mythic-beasts.com with esmtp (Exim 4.69) (envelope-from |
| 13 | + < [email protected]>) id 1TSCYa-0003Pa-H7 for [email protected]; Sat, 27 Oct |
| 14 | + 2012 21:00:32 +0100 |
| 15 | +Received: from popx (helo=localhost) by sphinx.mythic-beasts.com with |
| 16 | + local-esmtp (Exim 4.72) (envelope-from < [email protected]>) id |
| 17 | + 1TSCZp-0000qw-SY for [email protected]; Sat, 27 Oct 2012 21:01:52 +0100 |
| 18 | +Date: Sat, 27 Oct 2012 21:01:49 +0100 |
| 19 | +From: Jocelyn Ireson-Paine < [email protected]> |
| 20 | + |
| 21 | +To: Jan Wielemaker < [email protected]> |
| 22 | +Subject: Re: Speardsheets |
| 23 | + |
| 24 | + |
| 25 | + |
| 26 | +User-Agent: Alpine 2.02 (LRH 1266 2009-07-14) |
| 27 | +Content-Type: multipart/mixed; |
| 28 | + boundary="1566387330-823825051-1351367894=:7371" |
| 29 | + |
| 30 | +X-Spam-Status: No, score=-0.0 |
| 31 | +X-BlackCat-Spam-Score: 0 |
| 32 | +X-Bayes-Prob: 0.0001 (Score 0, tokens from: @@RPTN) |
| 33 | +X-Spam-Score: -1.10 () [Tag at 6.00] SPF(none:0),CC(GB:-0.1),RBL(RPRL-ham:-1.0) |
| 34 | +X-CanIt-Geo: ip=93.93.130.6; country=GB; latitude=54.0000; longitude=-2.0000; http://maps.google.com/maps?q=54.0000,-2.0000&z=6 |
| 35 | +X-CanItPRO-Stream: vu:Medium (inherits from vu:default,base:default) |
| 36 | +X-Canit-Stats-ID: 08Igk0xpb - fb8fc17e364e - 20121027 (trained as not-spam) |
| 37 | +X-Scanned-By: CanIt (www . roaringpenguin . com) on 195.169.124.154 |
| 38 | + |
| 39 | +X-MS-Exchange-Organization-AVStamp-Mailbox: MSFTFF;1;0;0 0 0 |
| 40 | +X-MS-Exchange-Organization-AuthSource: PEXEE011B.vu.local |
| 41 | +X-MS-Exchange-Organization-AuthAs: Anonymous |
| 42 | +MIME-Version: 1.0 |
| 43 | + |
| 44 | +--1566387330-823825051-1351367894=:7371 |
| 45 | +Content-Type: text/plain; charset="ISO-8859-15"; format=flowed |
| 46 | +Content-Transfer-Encoding: QUOTED-PRINTABLE |
| 47 | + |
| 48 | + |
| 49 | +Hi Jan, |
| 50 | + |
| 51 | +That model-discovery project is one I'd love to have. It would be so=20 |
| 52 | +useful in stimulating my structure-discovery work! |
| 53 | + |
| 54 | +It sounds as though you're satisfied that you can extract formulae and=20 |
| 55 | +other information from Excel. Does Excel now output files in Open Document= |
| 56 | +=20 |
| 57 | +Format, or do you need to convert them? |
| 58 | + |
| 59 | +Myself, I've tried various approaches to extracting formulae etc. First,=20 |
| 60 | +making Excel save them as SYLK. This is an old-fashioned Microsoft text=20 |
| 61 | +representation of a spreadsheet, intended for input to other programs such= |
| 62 | +=20 |
| 63 | +as Multiplan. It's easy to parse, but doesn't include all the information=20 |
| 64 | +about formats and other properties. |
| 65 | + |
| 66 | +Second, Excel 2003 introduced output to XML. So I've used your XML parser=20 |
| 67 | +to read these files, searching for the appropriate XML elements. This was=20 |
| 68 | +fairly easy to code, but I ran into memory problems with big spreadsheets,= |
| 69 | +=20 |
| 70 | +in particular with a a 200-sheet monster forecaster for social housing=20 |
| 71 | +finances which contained 60 interconnected 20=D740 tables. (See=20 |
| 72 | +http://arxiv.org/abs/0803.0163 , "Rapid Spreadsheet Reshaping with=20 |
| 73 | +Excelsior: multiple drastic changes to content and layout are easy when=20 |
| 74 | +you represent enough structure".) |
| 75 | + |
| 76 | +Third, a different approach. I wrote a VBA macro that looped around all=20 |
| 77 | +the cells in a spreadsheet and dumped their contents to a text file as=20 |
| 78 | +Prolog facts. I could then 'consult' this into Prolog and further analyse=20 |
| 79 | +it. The file was much smaller than Excel XML files, and I made it even=20 |
| 80 | +smaller by making the VBA detect runs of identical formulae and output=20 |
| 81 | +them as only one line. The advantage of this approach was the smallness of= |
| 82 | +=20 |
| 83 | +the file being read by Prolog, and the fact that I could customise the=20 |
| 84 | +macro to dump only what I needed. The disadvantage was that one needed=20 |
| 85 | +Windows and Excel in order to run the macro. By the way, if the macro is=20 |
| 86 | +any use to you, I'd be happy to send it. |
| 87 | + |
| 88 | +Fourth, I tried Fabien Todescato's library for connecting Prolog to Excel.= |
| 89 | +=20 |
| 90 | +But I couldn't make it work consistently, and I couldn't get the expert=20 |
| 91 | +help I needed to modify the .COM programming and call it via Prolog's=20 |
| 92 | +foreign-language interface. It would have taken too long for me to try=20 |
| 93 | +this on my own. You're lucky, because as a grant-holder, you can allocate=20 |
| 94 | +such tasks to a student or get help from your I.T. department... |
| 95 | + |
| 96 | +As far as parsing formulae goes, I've written two tokenisers and parsers,=20 |
| 97 | +in Prolog, which I'd also be happy to give you. Googling "excel formula=20 |
| 98 | +grammar" finds a lot of references, but I don't know which ones actually=20 |
| 99 | +work. http://homepages.ecs.vuw.ac.nz/~elvis/db/Excel.shtml ,=20 |
| 100 | +"Invesitgation into Excel Syntax and a Formula Grammar", looks reasonable.= |
| 101 | +=20 |
| 102 | +You need to be able to parse array formulae as well as scalar formulae,=20 |
| 103 | +and recognise error values such as #NAME. |
| 104 | + |
| 105 | +Also, to decide whether you want to treat cells that contain values such=20 |
| 106 | +as 1 or "A" differently from cells that contain constant formulae such as=20 |
| 107 | +=3D1 or =3D"A". |
| 108 | + |
| 109 | +After parsing comes the structure discovery. As you know, one thing I did=20 |
| 110 | +was to look for runs of formulae in which a subexpression changes on each=20 |
| 111 | +step by 1 in an integer or address. For example: |
| 112 | + 3*(A1+1) |
| 113 | + 3*(A2+1) |
| 114 | +or |
| 115 | + 3*(A1+1) |
| 116 | + 3*(A1+2) |
| 117 | +There are examples in several of my papers: a good one is the blog posting= |
| 118 | +=20 |
| 119 | +at http://www.j-paine.org/dobbs/structure_discovery.html , "How to Reveal=20 |
| 120 | +Implicit Structure in Spreadsheets". For example, |
| 121 | + Sheet1!E8 =3D Sheet1!C8-Sheet1!D8 |
| 122 | + Sheet1!E9 =3D Sheet1!C9-Sheet1!D9 |
| 123 | + Sheet1!E10 =3D Sheet1!C10-Sheet1!D10 |
| 124 | +become |
| 125 | + Sheet1!E[V1 in 8:10] =3D Sheet1!C[V1]-Sheet1!D[V1] |
| 126 | +where V1 is a variable that my algorithm introduces. The algorithm walks=20 |
| 127 | +over both terms in the same way that unification does. I suspect it's a=20 |
| 128 | +kind of anti-unification. |
| 129 | + |
| 130 | +Another thing is to make formulae more intelligible, rewriting them by=20 |
| 131 | +mapping addresses to elements of named arrays. For example, |
| 132 | + Sheet1!E8 =3D Sheet1!C8-Sheet1!D8 |
| 133 | +to |
| 134 | + net_profit[1] =3D income[1]-expenses[1] |
| 135 | +My software allows the user to specify mappings from cell ranges to named=20 |
| 136 | +arrays in an input file, which it will then read and use in rewriting the=20 |
| 137 | +formulae accordingly. |
| 138 | + |
| 139 | +As you pointed out, headers are useful in guessing the names to use. It's=20 |
| 140 | +fairly easy to code something that picks out the labels in a region that=20 |
| 141 | +one has identified roughly by eye. To discover how long the range=20 |
| 142 | +underneath or to the right of the label is, I've used the run detector=20 |
| 143 | +mentioned above. Combined with rewriting formulae, this can make the=20 |
| 144 | +ranges very intelligible. In the example above, |
| 145 | + Sheet1!E8 =3D Sheet1!C8-Sheet1!D8 |
| 146 | + Sheet1!E9 =3D Sheet1!C9-Sheet1!D9 |
| 147 | + Sheet1!E10 =3D Sheet1!C10-Sheet1!D10 |
| 148 | +which without renaming become this after run detection: |
| 149 | + Sheet1!E[V1 in 8:10] =3D Sheet1!C[V1]-Sheet1!D[V1] |
| 150 | +becomes |
| 151 | + net_profit[V0] =3D income[V0]-expenses[V0] |
| 152 | +with renaming and run detection. |
| 153 | + |
| 154 | +Here is some other work on detecting labels and structure: |
| 155 | + |
| 156 | +http://www.datadefractor.com/Portals/0/Documents/Structuring%20The%20Unstru= |
| 157 | +ctured.pdf=20 |
| 158 | +, |
| 159 | +"Structuring the Unstructured: How to Dimensionalize Semi-Structured=20 |
| 160 | +Business Data". |
| 161 | + |
| 162 | +http://arxiv.org/abs/0802.3924 , |
| 163 | +"A Toolkit for Scalable Spreadsheet Visualization", |
| 164 | +Markus Clermont. |
| 165 | + |
| 166 | +http://web.engr.orst.edu/~erwig/papers/TypeInf_PPDP06.pdf , |
| 167 | +"Type Inference for Spreadsheets", |
| 168 | +Robin Abraham and Martin Erwig. |
| 169 | + |
| 170 | +http://www.google.co.uk/url?sa=3Dt&rct=3Dj&q=3Dlabels+and+type+inference+in= |
| 171 | ++spreadsheets&source=3Dweb&cd=3D3&ved=3D0CC0QFjAC&url=3Dhttp%3A%2F%2Fcitese= |
| 172 | +erx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.153.6517%26rep%3Drep1%2= |
| 173 | +6type%3Dpdf&ei=3DhDmMUKisOujE0QXv5IDwBg&usg=3DAFQjCNEY4wUmgx3xdnFvOjmnCO9Gs= |
| 174 | +D7Baw |
| 175 | +"Software Engineering for Spreadsheets", |
| 176 | +Martin Erwig. |
| 177 | + |
| 178 | +By the way, I wonder whether these are any use to you on ecological |
| 179 | +models: |
| 180 | + |
| 181 | +http://www.era.lib.ed.ac.uk/handle/1842/4679 , |
| 182 | +"The Use of Prolog for Improving the Rigour and Accessibility of=20 |
| 183 | +Ecological Modelling", |
| 184 | +Alan Bundy, R. Muetzelfeldt, D. Robertson, M. Uschold. |
| 185 | + |
| 186 | +http://www.research.ed.ac.uk/portal/files/412346/Eco_Logic_Logic_Based_Appr= |
| 187 | +oaches_to_Ecological_Modelling.pdf=20 |
| 188 | +, |
| 189 | +"Eco-Logic: Logic-Based Approaches to Ecological Modelling", |
| 190 | +D. Robertson, A. Bundy, R. Muetzelfeldt, M. Haggith, M. Uschold. |
| 191 | + |
| 192 | +You might be interested in=20 |
| 193 | +http://www.j-paine.org/excelsior_2004/intro.html . This is an early=20 |
| 194 | +version of my structure-discovery program, to which I gave a=20 |
| 195 | +Prolog-TLI-style interface with a command language that could pass=20 |
| 196 | +spreadsheets around as values and operate on them. |
| 197 | + |
| 198 | +Cheers, |
| 199 | + |
| 200 | +Jocelyn Ireson-Paine |
| 201 | +http://www.j-paine.org |
| 202 | ++44 (0)7768 534 091 |
| 203 | + |
| 204 | +Jocelyn's Cartoons: |
| 205 | +http://www.j-paine.org/blog/jocelyns_cartoons/ |
| 206 | + |
| 207 | +On Fri, 26 Oct 2012, Jan Wielemaker wrote: |
| 208 | + |
| 209 | +> Hi Jocelyn, |
| 210 | +> |
| 211 | +> Thanks for getting back. This (sub-)project is about discovering the=20 |
| 212 | +> underlying `model' from spreadsheets as they are used in science, in |
| 213 | +> particular environmental research. What we are mostly after is what |
| 214 | +> the numbers mean. I.e., relate them to units and properties of |
| 215 | +> concepts. For example, a car (concept) produces (property) X (the number= |
| 216 | +)=20 |
| 217 | +> Kg/Km (unit) CO2 (concept). To do this, we need to reason |
| 218 | +> about layout, colours/fonts to find headers, link text fields to |
| 219 | +> ontologies and know what the formulas relate. For now, we won't |
| 220 | +> assume that the spreadsheet contains errors, but that may change :-) |
| 221 | +> |
| 222 | +> My job is mostly to get the infrastructure in place. I'm (as we |
| 223 | +> speak) writing a parser for ODS (the Open Document Standard). That |
| 224 | +> is fairly trivial; won't take more than 1 or 2 days to get all the |
| 225 | +> relevant stuff into Prolog facts. My next step is to define some |
| 226 | +> pattern primitives and demonstrate the basics to a PhD student. |
| 227 | +> She can do the real work :-) |
| 228 | +> |
| 229 | +> Oh, the sizes vary wildly. From a few hundred to a hundredthousands |
| 230 | +> of cells. |
| 231 | +> |
| 232 | +> I'm now working the expression grammar (defined in=20 |
| 233 | +> http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part2.html= |
| 234 | +#__RefHeading__1017930_715980110).=20 |
| 235 | +> If I'm doing double |
| 236 | +> work, please shout :-) As said, I'm probably not very interested |
| 237 | +> in executing the functions, but if you happen to have a library of |
| 238 | +> them, it might come handy. |
| 239 | +> |
| 240 | +> Cheers --- Jan |
| 241 | +> |
| 242 | +> |
| 243 | +> |
| 244 | +> On 10/26/2012 12:57 PM, Jocelyn Ireson-Paine wrote: |
| 245 | +>> Hi Jan, |
| 246 | +>>=20 |
| 247 | +>> Yes, I'm fine thanks. |
| 248 | +>>=20 |
| 249 | +>> I haven't entirely stopped working on structure discovery, but haven't |
| 250 | +>> done much recently because I wasn't able to get funding. If I could, I'd |
| 251 | +>> continue. There was also the difficulty of connecting SWI-Prolog to |
| 252 | +>> Excel. I do still have the Prolog code that I wrote up in my papers, but |
| 253 | +>> it isn't completely automatic (because analysis often needs a lot of |
| 254 | +>> trial and error), and the interface is crude. I don't think it would be |
| 255 | +>> easy for anyone else to use, and I'd have to explain the commands. |
| 256 | +>>=20 |
| 257 | +>> How big is the spreadsheet you need to analyse, i.e. how many sheets and |
| 258 | +>> cells per sheet? What do you need to do with it? Are you trying to |
| 259 | +>> compile its formulae into some other language? Do you need to check for |
| 260 | +>> errors, or are you reasonably certain that the spreadsheet does what |
| 261 | +>> it's meant to? |
| 262 | +>>=20 |
| 263 | +>> Cheers, |
| 264 | +>>=20 |
| 265 | +>> Jocelyn Ireson-Paine |
| 266 | +>> http://www.j-paine.org |
| 267 | +>> +44 (0)7768 534 091 |
| 268 | +>>=20 |
| 269 | +>> Jocelyn's Cartoons: |
| 270 | +>> http://www.j-paine.org/blog/jocelyns_cartoons/ |
| 271 | +>>=20 |
| 272 | +>> On Thu, 25 Oct 2012, Jan Wielemaker wrote: |
| 273 | +>>=20 |
| 274 | +>>> Hi Jocelyn, |
| 275 | +>>>=20 |
| 276 | +>>> All ok? We need to do some spreadsheet structure discovery for a |
| 277 | +>>> program, so I came along your work. It seems pretty old, so I assume |
| 278 | +>>> you moved on. I was wondering whether this software is available and |
| 279 | +>>> whether you have any recommendations on this topic. What you describe |
| 280 | +>>> is follows more or less my first intuition. In Bonn I recently attende= |
| 281 | +d |
| 282 | +>>> a talk proposing CHR for this job. |
| 283 | +>>> |
| 284 | +>>> Thanks --- Jan |
| 285 | +>>>=20 |
| 286 | +> |
| 287 | +>= |
| 288 | + |
| 289 | +--1566387330-823825051-1351367894=:7371-- |
0 commit comments