Skip to content

Commit 87896e6

Browse files
Jan WielemakerJan Wielemaker
Jan Wielemaker
authored and
Jan Wielemaker
committed
Added mail by Jocelyn
1 parent 5570b39 commit 87896e6

File tree

1 file changed

+289
-0
lines changed

1 file changed

+289
-0
lines changed

Jocelyn.mail

+289
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,289 @@
1+
Received: from PEXEE011B.vu.local (130.37.164.17) by PEXHB011B (130.37.236.65)
2+
with Microsoft SMTP Server (TLS) id 14.2.298.4; Sat, 27 Oct 2012 22:00:34
3+
+0200
4+
Received: from filter3-utr.mf.surf.net (195.169.124.154) by mailin.vu.nl
5+
(130.37.164.17) with Microsoft SMTP Server (TLS) id 14.2.298.4; Sat, 27 Oct
6+
2012 22:00:34 +0200
7+
Received: from balrog.mythic-beasts.com (balrog.mythic-beasts.com
8+
[93.93.130.6]) by filter3-utr.mf.surf.net (8.14.3/8.14.3/Debian-9.4) with
9+
ESMTP id q9RK0WdE002487 for <[email protected]>; Sat, 27 Oct 2012 22:00:33
10+
+0200
11+
Received: from [93.93.130.49] (helo=sphinx.mythic-beasts.com) by
12+
balrog.mythic-beasts.com with esmtp (Exim 4.69) (envelope-from
13+
<[email protected]>) id 1TSCYa-0003Pa-H7 for [email protected]; Sat, 27 Oct
14+
2012 21:00:32 +0100
15+
Received: from popx (helo=localhost) by sphinx.mythic-beasts.com with
16+
local-esmtp (Exim 4.72) (envelope-from <[email protected]>) id
17+
1TSCZp-0000qw-SY for [email protected]; Sat, 27 Oct 2012 21:01:52 +0100
18+
Date: Sat, 27 Oct 2012 21:01:49 +0100
19+
From: Jocelyn Ireson-Paine <[email protected]>
20+
X-X-Sender: [email protected]
21+
To: Jan Wielemaker <[email protected]>
22+
Subject: Re: Speardsheets
23+
In-Reply-To: <[email protected]>
24+
Message-ID: <[email protected]>
25+
26+
User-Agent: Alpine 2.02 (LRH 1266 2009-07-14)
27+
Content-Type: multipart/mixed;
28+
boundary="1566387330-823825051-1351367894=:7371"
29+
Content-ID: <[email protected]>
30+
X-Spam-Status: No, score=-0.0
31+
X-BlackCat-Spam-Score: 0
32+
X-Bayes-Prob: 0.0001 (Score 0, tokens from: @@RPTN)
33+
X-Spam-Score: -1.10 () [Tag at 6.00] SPF(none:0),CC(GB:-0.1),RBL(RPRL-ham:-1.0)
34+
X-CanIt-Geo: ip=93.93.130.6; country=GB; latitude=54.0000; longitude=-2.0000; http://maps.google.com/maps?q=54.0000,-2.0000&z=6
35+
X-CanItPRO-Stream: vu:Medium (inherits from vu:default,base:default)
36+
X-Canit-Stats-ID: 08Igk0xpb - fb8fc17e364e - 20121027 (trained as not-spam)
37+
X-Scanned-By: CanIt (www . roaringpenguin . com) on 195.169.124.154
38+
Return-Path: [email protected]
39+
X-MS-Exchange-Organization-AVStamp-Mailbox: MSFTFF;1;0;0 0 0
40+
X-MS-Exchange-Organization-AuthSource: PEXEE011B.vu.local
41+
X-MS-Exchange-Organization-AuthAs: Anonymous
42+
MIME-Version: 1.0
43+
44+
--1566387330-823825051-1351367894=:7371
45+
Content-Type: text/plain; charset="ISO-8859-15"; format=flowed
46+
Content-Transfer-Encoding: QUOTED-PRINTABLE
47+
Content-ID: <[email protected]>
48+
49+
Hi Jan,
50+
51+
That model-discovery project is one I'd love to have. It would be so=20
52+
useful in stimulating my structure-discovery work!
53+
54+
It sounds as though you're satisfied that you can extract formulae and=20
55+
other information from Excel. Does Excel now output files in Open Document=
56+
=20
57+
Format, or do you need to convert them?
58+
59+
Myself, I've tried various approaches to extracting formulae etc. First,=20
60+
making Excel save them as SYLK. This is an old-fashioned Microsoft text=20
61+
representation of a spreadsheet, intended for input to other programs such=
62+
=20
63+
as Multiplan. It's easy to parse, but doesn't include all the information=20
64+
about formats and other properties.
65+
66+
Second, Excel 2003 introduced output to XML. So I've used your XML parser=20
67+
to read these files, searching for the appropriate XML elements. This was=20
68+
fairly easy to code, but I ran into memory problems with big spreadsheets,=
69+
=20
70+
in particular with a a 200-sheet monster forecaster for social housing=20
71+
finances which contained 60 interconnected 20=D740 tables. (See=20
72+
http://arxiv.org/abs/0803.0163 , "Rapid Spreadsheet Reshaping with=20
73+
Excelsior: multiple drastic changes to content and layout are easy when=20
74+
you represent enough structure".)
75+
76+
Third, a different approach. I wrote a VBA macro that looped around all=20
77+
the cells in a spreadsheet and dumped their contents to a text file as=20
78+
Prolog facts. I could then 'consult' this into Prolog and further analyse=20
79+
it. The file was much smaller than Excel XML files, and I made it even=20
80+
smaller by making the VBA detect runs of identical formulae and output=20
81+
them as only one line. The advantage of this approach was the smallness of=
82+
=20
83+
the file being read by Prolog, and the fact that I could customise the=20
84+
macro to dump only what I needed. The disadvantage was that one needed=20
85+
Windows and Excel in order to run the macro. By the way, if the macro is=20
86+
any use to you, I'd be happy to send it.
87+
88+
Fourth, I tried Fabien Todescato's library for connecting Prolog to Excel.=
89+
=20
90+
But I couldn't make it work consistently, and I couldn't get the expert=20
91+
help I needed to modify the .COM programming and call it via Prolog's=20
92+
foreign-language interface. It would have taken too long for me to try=20
93+
this on my own. You're lucky, because as a grant-holder, you can allocate=20
94+
such tasks to a student or get help from your I.T. department...
95+
96+
As far as parsing formulae goes, I've written two tokenisers and parsers,=20
97+
in Prolog, which I'd also be happy to give you. Googling "excel formula=20
98+
grammar" finds a lot of references, but I don't know which ones actually=20
99+
work. http://homepages.ecs.vuw.ac.nz/~elvis/db/Excel.shtml ,=20
100+
"Invesitgation into Excel Syntax and a Formula Grammar", looks reasonable.=
101+
=20
102+
You need to be able to parse array formulae as well as scalar formulae,=20
103+
and recognise error values such as #NAME.
104+
105+
Also, to decide whether you want to treat cells that contain values such=20
106+
as 1 or "A" differently from cells that contain constant formulae such as=20
107+
=3D1 or =3D"A".
108+
109+
After parsing comes the structure discovery. As you know, one thing I did=20
110+
was to look for runs of formulae in which a subexpression changes on each=20
111+
step by 1 in an integer or address. For example:
112+
3*(A1+1)
113+
3*(A2+1)
114+
or
115+
3*(A1+1)
116+
3*(A1+2)
117+
There are examples in several of my papers: a good one is the blog posting=
118+
=20
119+
at http://www.j-paine.org/dobbs/structure_discovery.html , "How to Reveal=20
120+
Implicit Structure in Spreadsheets". For example,
121+
Sheet1!E8 =3D Sheet1!C8-Sheet1!D8
122+
Sheet1!E9 =3D Sheet1!C9-Sheet1!D9
123+
Sheet1!E10 =3D Sheet1!C10-Sheet1!D10
124+
become
125+
Sheet1!E[V1 in 8:10] =3D Sheet1!C[V1]-Sheet1!D[V1]
126+
where V1 is a variable that my algorithm introduces. The algorithm walks=20
127+
over both terms in the same way that unification does. I suspect it's a=20
128+
kind of anti-unification.
129+
130+
Another thing is to make formulae more intelligible, rewriting them by=20
131+
mapping addresses to elements of named arrays. For example,
132+
Sheet1!E8 =3D Sheet1!C8-Sheet1!D8
133+
to
134+
net_profit[1] =3D income[1]-expenses[1]
135+
My software allows the user to specify mappings from cell ranges to named=20
136+
arrays in an input file, which it will then read and use in rewriting the=20
137+
formulae accordingly.
138+
139+
As you pointed out, headers are useful in guessing the names to use. It's=20
140+
fairly easy to code something that picks out the labels in a region that=20
141+
one has identified roughly by eye. To discover how long the range=20
142+
underneath or to the right of the label is, I've used the run detector=20
143+
mentioned above. Combined with rewriting formulae, this can make the=20
144+
ranges very intelligible. In the example above,
145+
Sheet1!E8 =3D Sheet1!C8-Sheet1!D8
146+
Sheet1!E9 =3D Sheet1!C9-Sheet1!D9
147+
Sheet1!E10 =3D Sheet1!C10-Sheet1!D10
148+
which without renaming become this after run detection:
149+
Sheet1!E[V1 in 8:10] =3D Sheet1!C[V1]-Sheet1!D[V1]
150+
becomes
151+
net_profit[V0] =3D income[V0]-expenses[V0]
152+
with renaming and run detection.
153+
154+
Here is some other work on detecting labels and structure:
155+
156+
http://www.datadefractor.com/Portals/0/Documents/Structuring%20The%20Unstru=
157+
ctured.pdf=20
158+
,
159+
"Structuring the Unstructured: How to Dimensionalize Semi-Structured=20
160+
Business Data".
161+
162+
http://arxiv.org/abs/0802.3924 ,
163+
"A Toolkit for Scalable Spreadsheet Visualization",
164+
Markus Clermont.
165+
166+
http://web.engr.orst.edu/~erwig/papers/TypeInf_PPDP06.pdf ,
167+
"Type Inference for Spreadsheets",
168+
Robin Abraham and Martin Erwig.
169+
170+
http://www.google.co.uk/url?sa=3Dt&rct=3Dj&q=3Dlabels+and+type+inference+in=
171+
+spreadsheets&source=3Dweb&cd=3D3&ved=3D0CC0QFjAC&url=3Dhttp%3A%2F%2Fcitese=
172+
erx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.153.6517%26rep%3Drep1%2=
173+
6type%3Dpdf&ei=3DhDmMUKisOujE0QXv5IDwBg&usg=3DAFQjCNEY4wUmgx3xdnFvOjmnCO9Gs=
174+
D7Baw
175+
"Software Engineering for Spreadsheets",
176+
Martin Erwig.
177+
178+
By the way, I wonder whether these are any use to you on ecological
179+
models:
180+
181+
http://www.era.lib.ed.ac.uk/handle/1842/4679 ,
182+
"The Use of Prolog for Improving the Rigour and Accessibility of=20
183+
Ecological Modelling",
184+
Alan Bundy, R. Muetzelfeldt, D. Robertson, M. Uschold.
185+
186+
http://www.research.ed.ac.uk/portal/files/412346/Eco_Logic_Logic_Based_Appr=
187+
oaches_to_Ecological_Modelling.pdf=20
188+
,
189+
"Eco-Logic: Logic-Based Approaches to Ecological Modelling",
190+
D. Robertson, A. Bundy, R. Muetzelfeldt, M. Haggith, M. Uschold.
191+
192+
You might be interested in=20
193+
http://www.j-paine.org/excelsior_2004/intro.html . This is an early=20
194+
version of my structure-discovery program, to which I gave a=20
195+
Prolog-TLI-style interface with a command language that could pass=20
196+
spreadsheets around as values and operate on them.
197+
198+
Cheers,
199+
200+
Jocelyn Ireson-Paine
201+
http://www.j-paine.org
202+
+44 (0)7768 534 091
203+
204+
Jocelyn's Cartoons:
205+
http://www.j-paine.org/blog/jocelyns_cartoons/
206+
207+
On Fri, 26 Oct 2012, Jan Wielemaker wrote:
208+
209+
> Hi Jocelyn,
210+
>
211+
> Thanks for getting back. This (sub-)project is about discovering the=20
212+
> underlying `model' from spreadsheets as they are used in science, in
213+
> particular environmental research. What we are mostly after is what
214+
> the numbers mean. I.e., relate them to units and properties of
215+
> concepts. For example, a car (concept) produces (property) X (the number=
216+
)=20
217+
> Kg/Km (unit) CO2 (concept). To do this, we need to reason
218+
> about layout, colours/fonts to find headers, link text fields to
219+
> ontologies and know what the formulas relate. For now, we won't
220+
> assume that the spreadsheet contains errors, but that may change :-)
221+
>
222+
> My job is mostly to get the infrastructure in place. I'm (as we
223+
> speak) writing a parser for ODS (the Open Document Standard). That
224+
> is fairly trivial; won't take more than 1 or 2 days to get all the
225+
> relevant stuff into Prolog facts. My next step is to define some
226+
> pattern primitives and demonstrate the basics to a PhD student.
227+
> She can do the real work :-)
228+
>
229+
> Oh, the sizes vary wildly. From a few hundred to a hundredthousands
230+
> of cells.
231+
>
232+
> I'm now working the expression grammar (defined in=20
233+
> http://docs.oasis-open.org/office/v1.2/os/OpenDocument-v1.2-os-part2.html=
234+
#__RefHeading__1017930_715980110).=20
235+
> If I'm doing double
236+
> work, please shout :-) As said, I'm probably not very interested
237+
> in executing the functions, but if you happen to have a library of
238+
> them, it might come handy.
239+
>
240+
> Cheers --- Jan
241+
>
242+
>
243+
>
244+
> On 10/26/2012 12:57 PM, Jocelyn Ireson-Paine wrote:
245+
>> Hi Jan,
246+
>>=20
247+
>> Yes, I'm fine thanks.
248+
>>=20
249+
>> I haven't entirely stopped working on structure discovery, but haven't
250+
>> done much recently because I wasn't able to get funding. If I could, I'd
251+
>> continue. There was also the difficulty of connecting SWI-Prolog to
252+
>> Excel. I do still have the Prolog code that I wrote up in my papers, but
253+
>> it isn't completely automatic (because analysis often needs a lot of
254+
>> trial and error), and the interface is crude. I don't think it would be
255+
>> easy for anyone else to use, and I'd have to explain the commands.
256+
>>=20
257+
>> How big is the spreadsheet you need to analyse, i.e. how many sheets and
258+
>> cells per sheet? What do you need to do with it? Are you trying to
259+
>> compile its formulae into some other language? Do you need to check for
260+
>> errors, or are you reasonably certain that the spreadsheet does what
261+
>> it's meant to?
262+
>>=20
263+
>> Cheers,
264+
>>=20
265+
>> Jocelyn Ireson-Paine
266+
>> http://www.j-paine.org
267+
>> +44 (0)7768 534 091
268+
>>=20
269+
>> Jocelyn's Cartoons:
270+
>> http://www.j-paine.org/blog/jocelyns_cartoons/
271+
>>=20
272+
>> On Thu, 25 Oct 2012, Jan Wielemaker wrote:
273+
>>=20
274+
>>> Hi Jocelyn,
275+
>>>=20
276+
>>> All ok? We need to do some spreadsheet structure discovery for a
277+
>>> program, so I came along your work. It seems pretty old, so I assume
278+
>>> you moved on. I was wondering whether this software is available and
279+
>>> whether you have any recommendations on this topic. What you describe
280+
>>> is follows more or less my first intuition. In Bonn I recently attende=
281+
d
282+
>>> a talk proposing CHR for this job.
283+
>>>
284+
>>> Thanks --- Jan
285+
>>>=20
286+
>
287+
>=
288+
289+
--1566387330-823825051-1351367894=:7371--

0 commit comments

Comments
 (0)