-
Notifications
You must be signed in to change notification settings - Fork 0
/
research_publications.html
1400 lines (1086 loc) · 133 KB
/
research_publications.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<title>Church & Duncan: Deep Data Science and Machine Learning Consulting</title>
<!-- Favicon -->
<link href="assets/favicon.ico" rel="shortcut icon" />
<link href="assets/apple-touch-icon.png" rel="apple-touch-icon" />
<!-- Fonts -->
<link href="https://fonts.googleapis.com/css?family=Open+Sans:400,600,700|Raleway:400,600" rel="stylesheet">
<link href="assets/vendor/fontawesome-4.6.3/css/font-awesome.min.css" rel="stylesheet" />
<!-- CSS -->
<link href="assets/vendor/animsition-4.0.2/css/animsition.min.css" rel="stylesheet" />
<link href="assets/vendor/bootstrap-3.3.6/css/bootstrap.min.css" rel="stylesheet" />
<link href="assets/vendor/swiper-3.3.1/css/swiper.min.css" rel="stylesheet" />
<!-- Main CSS-->
<link href="assets/css/style.css" rel="stylesheet" />
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">
</head>
<body>
<!-- begin animsition -->
<div class="animsition">
<nav class="navbar navbar-default navbar-fixed-top" id="nav1">
<!--
<div class="topbar">
<div class="container">
<ul class="list-unstyled list-inline">
<li><i class="fa fa-envelope-o"></i>[email protected]</li>
<li><i class="fa fa-phone"></i> +658 456 789</li>
</ul>
</div>
</div> -->
<div class="container navbar-padding">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1" aria-expanded="false">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="current navbar-brand" href="index.html">CHURCH & DUNCAN</a>
</div>
<div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
<ul class="nav navbar-nav navbar-right">
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Services<span class="caret"></span></a>
<ul class="dropdown-menu">
<li><a href="data_strategy_advisory_consulting.html">Advisory: Data Strategy</a></li>
<li role="separator" class="divider"></li>
<li><a href="anomaly_detection_consulting_engineering.html">Anomaly & Fraud Detection</a></li>
<li role="separator" class="divider"></li>
<li><a href="customer_support_automation_consulting_engineering.html">Customer Support Automation</a></li>
<li role="separator" class="divider"></li>
<li><a href="churn_analytics_consulting_engineering.html">Churn Analytics</a></li>
<li role="separator" class="divider"></li>
<li><a href="demand_forecasting_consulting_engineering.html">Supply & Demand Forecasting</a></li>
<li role="separator" class="divider"></li>
<li><a href="credit_scoring_consulting_engineering.html">Credit Scoring</a></li>
<li role="separator" class="divider"></li>
<li><a href="personalized_marketing_consulting_engineering.html">Personalized Marketing</a></li>
<li role="separator" class="divider"></li>
<li><a href="digital_advertising_consulting_engineering.html">Digital Advertising</a></li>
<li role="separator" class="divider"></li>
<li><a href="search_recommendations_consulting_engineering.html">Search & Recommendations</a></li>
<li role="separator" class="divider"></li>
<!--<li><a href="supply_chain_optimization_consulting_engineering.html">Supply Chain Optimization</a></li>
<li role="separator" class="divider"></li>-->
<li><a href="text_categorization_consulting_engineering.html">Content Mining & Understanding</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Training<span class="caret"></span></a>
<ul class="dropdown-menu">
<li><a href="deep_learning_deep_dive_training.html">Deep Learning & Tensorflow</a></li>
<li role="separator" class="divider"></li>
<li><a href="large_scale_machine_learning_training.html">Machine Learning at Scale</a></li>
<li role="separator" class="divider"></li>
<li><a href="apache_spark_hadoop_deep_dive_training.html">Apache Spark v2.0 & Hadoop</a></li>
<li role="separator" class="divider"></li>
<li><a href="content_quality_anti_spam_anti_fraud_deep_dive_training.html">Anti-Spam/Fraud Deep Dive</a></li>
<li role="separator" class="divider"></li>
<!--
<li><a href="introduction_data_science_training.html">Introduction to Data Science</a></li>
<li role="separator" class="divider"></li> -->
<li><a href="machine_learning_theory_training.html">Machine Learning Theory</a></li>
<li role="separator" class="divider"></li>
<li><a href="opinion_mining_sentiment_analysis_deep_dive_training.html">Opinion Mining Deep Dive</a></li>
<li role="separator" class="divider"></li>
<li><a href="text_mining_deep_dive_training.html">Text Analytics at Scale</a></li>
<li role="separator" class="divider"></li>
<li><a href="recommender_systems_deep_dive_training.html">Zero to SOTA: Recommendations</a></li>
<li role="separator" class="divider"></li>
<li><a href="information_retrieval_search_engines_deep_dive_training.html">Zero to SOTA: Search Engines</a></li>
<!--<li role="separator" class="divider"></li>
<li><a href="deep_learning_deep_dive_training.html">Time Series Deep Dive</a></li>-->
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Resources<span class="caret"></span></a>
<ul class="dropdown-menu">
<!--
<li><a href="featured_projects.html">Featured Projects</a></li>
<li role="separator" class="divider"></li>-->
<li><a href="presentations_slides_talks_keynotes.html">Presentations & Talks</a></li>
<li role="separator" class="divider"></li>
<li><a href="research_publications.html">Research Papers & Patents</a></li>
</ul>
</li>
<li>
<a href="about.html">About</a>
</li>
<li>
<a href="contact.html">Contact</a>
</li>
</ul>
</div> <!-- /.collapse -->
</div> <!-- /.container -->
</nav>
<div class="intro-careers intro-pages parallax-window" data-parallax="scroll" data-image-src="assets/images/bg1.jpg">
<div class="intro-content">
<h2>Research Papers & Patents</h2>
<p>Understanding the importance of basic research, we consistently participate in academic conferences, review for journals, and do our own original research.</p>
</div> <!-- /.intro-content -->
<div id="startchange"></div> <!-- start navbar animation -->
</div> <!-- /.intro-faq -->
<div class="list-group long-page-submenu absolute-submenu long-page-submenu-init">
<a href="#books" id="books-menu" class="list-group-item books">Books</a>
<a href="#digital-advertising" id="digital-advertising-menu" class="list-group-item digital-advertising">Digital Advertising</a>
<a href="#information-retrieval" id="information-retrieval-menu" class="list-group-item information-retrieval">Information Retrieval</a>
<a href="#machine-learning" id="machine-learning-menu" class="list-group-item machine-learning">Machine Learning</a>
<a href="#text-mining" id="text-mining-menu" class="list-group-item text-mining">Text Mining</a>
<a href="#patents" id="patents-menu" class="list-group-item patents">Patents</a>
</div>
<section class="faq-main">
<div class="container">
<div class="row">
<div class="col-md-8 col-md-offset-2 research-publications">
<header>
<h2 id="books">Books</h2>
</header>
<div class="panel-group" id="faqAccordion">
<div class="panel panel-default ">
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question1">
<h4 class="panel-title research-paper">
<strong>Exploring attitude and affect in text: Theories and applications</strong> by <em>Y Qu, J Shanahan, J Wiebe</em>, The AAAI Spring Symposium, AAAI Press, 2004 <a href="https://www.amazon.com/Computing-Attitude-Affect-Text-Applications/dp/1402040261" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question1" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Human language technology systems have typically focused on the factual aspect of content analysis. Other aspects, including pragmatics, point of view, and style, have received much less attention. However, to achieve an adequate understanding of a text, these aspects cannot be ignored. In this symposium, we address computer-based analysis of point of view. Our goal is to bring together people from academia, government, and industry to explore annotation, modeling, mining, and classification of opinion, subjectivity, attitude, and affect in text, across a range of text management applications. The symposium therefore addresses a rather wide range of issues, from theoretical questions and models, through annotation standards and methods, to algorithms for recognizing, clustering, characterizing, and displaying attitudes and affect in text. Despite growing interest in this area, with papers recently published in major conferences and new corpora developed, there has never been a workshop or symposium that targets a wide audience of researchers and practitioners on these topics.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question2">
<h4 class="panel-title research-paper">
<strong>Soft computing for knowledge discovery: introducing Cartesian granule features</strong> by <em>James G. Shanahan</em>, The Springer International Series in Engineering and Computer Science, Springer, 2000 <a href="https://www.amazon.com/Soft-Computing-Knowledge-Discovery-International/dp/1461369479" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question2" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Knowledge discovery is an area of computer science that attempts to uncover interesting and useful patterns in data that permit a computer to perform a task autonomously or assist a human in performing a task more efficiently. Soft Computing for Knowledge Discovery provides a self-contained and systematic exposition of the key theory and algorithms that form the core of knowledge discovery from a soft computing perspective. It focuses on knowledge representation, machine learning, and the key methodologies that make up the fabric of soft computing - fuzzy set theory, fuzzy logic, evolutionary computing, and various theories of probability (e.g. naive Bayes and Bayesian networks, Dempster-Shafer theory, mass assignment theory, and others). In addition to describing many state-of-the-art soft computing approaches to knowledge discovery, the author introduces Cartesian granule features and their corresponding learning algorithms as an intuitive approach to knowledge discovery. This new approach embraces the synergistic spirit of soft computing and exploits uncertainty in order to achieve tractability, transparency and generalization. Parallels are drawn between this approach and other well known approaches (such as naive Bayes and decision trees) leading to equivalences under certain conditions. The approaches presented are further illustrated in a battery of both artificial and real-world problems. Knowledge discovery in real-world problems, such as object recognition in outdoor scenes, medical diagnosis and control, is described in detail. These case studies provide further examples of how to apply the presented concepts and algorithms to practical problems. The author provides web page access to an online bibliography, datasets, source codes for several algorithms described in the book, and other information. Soft Computing for Knowledge Discovery is for advanced undergraduates, professionals and researchers in computer science, engineering and business information systems who work or have an interest in the dynamic fields of knowledge discovery and soft computing.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
</div><!--/panel-->
</div><!--/panel-group-->
</div> <!-- /.col-md-8 -->
<div class="col-md-8 col-md-offset-2 research-publications">
<header>
<h2 id="digital-advertising">Digital Advertising</h2>
</header>
<div class="panel-group" id="faqAccordion">
<div class="panel panel-default ">
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question3">
<h4 class="panel-title research-paper">
<strong>Econometric analysis and digital marketing: how to measure the effectiveness of an ad</strong> by <em>A Farahat, J Shanahan</em>, Proceedings of the sixth ACM international conference on Web search and data mining <a href="http://dl.acm.org/citation.cfm?id=2433502" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question3" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Over the past 18 years online advertising has grown to a $70 billion industry worldwide annually. Despite this impressive growth, online advertising faces many (and some would say traditional) challenges including how to measure the efficiency or the potential loss of sales caused by the inefficient use of advertising dollars. Consequently, it is vital to measure, maximize, and benchmark the efficiency of advertising media expenditures. This tutorial introduces the field of econometrics as a means of measuring the effectiveness of digital marketing. Econometrics is a field that extends and applies statistical methods to the analysis of economic phenomena. In that vein, econometrics goes beyond traditional statistics and explicitly recognizes the complexities of human behavior. Consider for example the impact of deep discounts on survival of restaurants. Struggling businesses are more likely to offer these deep discounts and eventually fail. A naive application of statistical techniques will overestimate the impact of deep discounts on business survival. In this case, the discounts are an endogenous variable as compared to an exogenous variable. This type of specification error highlights why we need a deeper look at the variables that go into statistical models. Econometrics addresses these and other issues in a formal and rigorous manner.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question4">
<h4 class="panel-title research-paper">
<strong>Learning to active learn</strong> by <em>JG Shanahan, N Lipka, D Van den Poel</em>, ACM SIGIR Workshop: Internet Advertising <a href="http://dl.acm.org/citation.cfm?id=2789993" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question4" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Active learning is a form of supervised machine learning in which the learning algorithm is able to interactively query the teacher to obtain a label for new data points. There are situations in which unlabeled data is abundant but labeling data is expensive. In such a scenario the learning algorithm can actively query the user/teacher for labels. Since the learner chooses the examples, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question5">
<h4 class="panel-title research-paper">
<strong>Digital advertising: An information scientist's perspective</strong> by <em>JG Shanahan, G Kurra</em>, Advanced Topics in Information Retrieval <a href="http://link.springer.com/chapter/10.1007/978-3-642-20946-8_9#page-1" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question5" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Digital online advertising is a form of promotion that uses the Internet and Web for the express purpose of delivering marketing messages to attract customers. Examples of online advertising include text ads that appear on search engine results pages, banner ads, in-text ads, or Rich Media ads that appear on regular web pages, portals, or applications. Over the past 15 years online advertising, a $65 billion industry worldwide in 2009, has been pivotal to the success of the Web. That being said, the field of advertising has been equally revolutionized by the Internet, Web, and more recently, by the emergence of the social web, and mobile devices. This success has arisen largely from the transformation of the advertising industry from a low-tech, human intensive, "Mad Men" way of doing work to highly optimized, quantitative, mathematical, computer- and data-centric processes that enable highly targeted, personalized, performance-based advertising. This chapter provides a clear and detailed overview of the technologies and business models that are transforming the field of online advertising primarily from statistical machine learning and information science perspectives.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question6">
<h4 class="panel-title research-paper">
<strong>Determining optimal advertisement frequency capping policy via Markov decision processes to maximize click through rates</strong> by <em>J Shanahan, D den Poel</em>, Proceedings of NIPS Workshop: Machine Learning in Online Advertising <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.222.5651&rep=rep1&type=pdf#page=45" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question6" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Digital online advertising is a form of promotion that uses the Internet and World Wide Web for the express purpose of delivering marketing messages to attract customers. Frequency capping is a term in digital advertising that means restricting (or capping) the amount of times (frequency) a specific visitor to a website or group of websites (in the case of ad networks) is shown a particular advertisement. Frequency capping is a feature within ad serving that allows the advertiser/ad-network to limit the maximum number of impressions/views a visitor can see a specific ad within a period of time. The advertiser or advertising network specifies a limit to the number of impressions you will allow per day, per week, or per month for an individual user. Frequency capping is often viewed as a key means in preventing banner burnout (the point where visitors are being overexposed and response drops) and in maintaining a competitive quality score (a core component of expected CPM-based ranking). Generally, the frequency capping policy for an ad is heuristically set by the advertiser or is determined heuristically by the ad network where the ad runs and is optimized for short term gain. In this paper we propose a data driven and principled approach that optimizes the life time value of site visitors. We propose to set frequency capping policies for different online marketing segments using Markov decision processes (MDP). Managing targeted marketing (customer relationship management) campaigns in this way can lead to substantial improvement in several business metrics such as click through rates and revenue. Though the current proposed approach lacks evaluation at the time of submission it is hoped to complete a study using this approach and present the results at the workshop.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question7">
<h4 class="panel-title research-paper">
<strong>Information Retrieval in Advertising</strong> by <em>Ewa Dominowska, Eugene Agichtein, Evgeniy Gabrilovich, James G Shanahan</em>, ACM SIGIR 2008 IRA Workshop Proceedings <a href="http://gabrilovich.com/publications/papers/Agichtein2008IRA.pdf" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question7" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Advertising is a multi-billion dollar industry that has become a significant component of the Web browsing experience. Online advertising systems incorporate many information retrieval techniques by combining content analysis, user interaction models, and commercial constraints. Advances in online advertising have come from integrating several core research areas: information retrieval, data mining, machine learning, and user modeling. The workshop will serve as an open forum for discussion of new ideas and current research related to information retrieval topics relevant to online advertising. The outcome will be a set of full and short papers covering a variety of topics. The short paper format will allow researchers new to the area to actively participate and explore novel themes. It will also enable researchers without access to extensive empirical data to propose ideas and experiments. We also expect the workshop to help develop a community of researchers interested in this area, and yield future collaboration and exchanges. Despite its commercial significance, advertising is a rather young field of research. This workshop will help the emerging research community better organize and develop a common perspective. The workshop will serve as a forum for researchers and industry participants to exchange latest ideas and best practices while encouraging future breakthroughs. It will also aid in fostering collaboration between industry and academia.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
</div><!--/panel-->
</div><!--/panel-group-->
</div> <!-- /.col-md-8 -->
<div class="col-md-8 col-md-offset-2 research-publications">
<header>
<h2 id="information-retrieval">Information Retrieval</h2>
</header>
<div class="panel-group" id="faqAccordion">
<div class="panel panel-default ">
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question8">
<h4 class="panel-title research-paper">
<strong>A Comparative Study of Query-biased and Non-redundant Snippets for Structured Search on Mobile Devices</strong> by <em>NV Spirin, AS Kotov, KG Karahalios, V Mladenov, PA Izhutov</em>, Proceedings of the 25th ACM International on Conference on Information and Knowledge Management <a href="http://dl.acm.org/citation.cfm?id=2983699" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question8" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>To investigate what kind of snippets are better suited for structured search on mobile devices, we built an experimental mobile search application and conducted a task-oriented interactive user study with 36 participants. Four different versions of a search engine result page (SERP) were compared by varying the snippet type (query-biased vs. non-redundant) and the snippet length (two vs. four lines per result). We adopted a within-subjects experiment design and made each participant do four realistic search tasks using different versions of the application. During the study sessions, we collected search logs, "think-aloud" comments, and post-task surveys. Each session was finalized with an interview. We found that with non-redundant snippets the participants were able to complete the tasks faster and find more relevant results. Most participants preferred non-redundant snippets and wanted to see more information about each result on the SERP for any snippet type. Yet, the participants felt that the version with query-biased snippets was easier to use. We conclude with a set of practical design recommendations.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question9">
<h4 class="panel-title research-paper">
<strong>Optimizing search user interfaces and interactions within professional social networks</strong> by <em>NV Spirin</em>, Proceedings of the Ninth ACM International Conference on Web Search and Data Mining <a href="http://dl.acm.org/citation.cfm?id=2855092" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question9" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>To help users cope with the scale and influx of new information, professional social networks (PSNs) provide a search functionality. However, most of the search engines within PSNs today only support keyword queries and basic faceted search capabilities overlooking serendipitous network exploration and search for relationships between entities. This results in siloed information and a limited search space. My thesis is that we must redesign all major elements of a search user interface, such as input, control, and informational, to enable more effective search interactions within PSNs. I will introduce new insights and algorithms supporting the thesis.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question10">
<h4 class="panel-title research-paper">
<strong>The Impact of Technical Domain Expertise on Search Behavior and Task Outcome</strong> by <em>J Kiseleva, AM Garcia, J Kamps, N Spirin</em>, WSDM 2016 Workshop on User Understanding <a href="https://arxiv.org/abs/1512.07051" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question10" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Domain expertise is regarded as one of the key factors impacting search success: experts are known to write more effective queries, to select the right results on the result page, and to find answers satisfying their information needs. Search transaction logs play the crucial role in the result ranking. Yet despite the variety in expertise levels of users, all prior interactions are treated alike, suggesting that weighting in expertise can improve the ranking for informational tasks. The main aim of this paper is to investigate the impact of high levels of technical domain expertise on both search behavior and task outcome. We conduct an online user study with searchers proficient in programming languages. We focus on Java and Javascript, yet we believe that our study and results are applicable for other expertise-sensitive search tasks. The main findings are three-fold: First, we constructed expertise tests that effectively measure technical domain expertise and correlate well with the self-reported expertise. Second, we showed that there is a clear position bias, but technical domain experts were less affected by position bias. Third, we found that general expertise helped finding the correct answers, but the domain experts were more successful as they managed to detect better answers. Our work is using explicit tests to determine user expertise levels, which is an important step toward fully automatic detection of expertise levels based on interaction behavior. A deeper understanding of the impact of expertise on search behavior and task outcome can enable more effective use of expert behavior in search logs - essentially make everyone search as an expert.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question11">
<h4 class="panel-title research-paper">
<strong>Relevance-aware Filtering of Tuples Sorted by an Attribute Value via Direct Optimization of Search Quality Metrics</strong> by <em>NV Spirin, M Kuznetsov, J Kiseleva, YV Spirin, PA Izhutov</em>, Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval <a href="http://dl.acm.org/citation.cfm?id=2767822" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question11" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Sorting tuples by an attribute value is a common search scenario and many search engines support such capabilities, e.g. price-based sorting in e-commerce, time-based sorting on a job or social media website. However, sorting purely by the attribute value might lead to poor user experience because the relevance is not taken into account. Hence, at the top of the list the users might see irrelevant results. In this paper we choose a different approach. Rather than just returning the entire list of results sorted by the attribute value, additionally we suggest doing the relevance-aware search results (post-) filtering. Following this approach, we develop a new algorithm based on the dynamic programming that directly optimizes a given search quality metric. It can be seamlessly integrated as the final step of a query processing pipeline and provides a theoretical guarantee on optimality. We conduct a comprehensive evaluation of our algorithm on synthetic data and real learning to rank data sets. Based on the experimental results, we conclude that the proposed algorithm is superior to typically used heuristics and has a clear practical value for the search and related applications.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question12">
<h4 class="panel-title research-paper">
<strong>People Search within an Online Social Network: Large Scale Analysis of Facebook Graph Search Query Logs</strong> by <em>NV Spirin, J He, M Develin, KG Karahalios, M Boucher</em>, Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management <a href="http://dl.acm.org/citation.cfm?id=2661967" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question12" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Popular online social networks (OSN) generate hundreds of terabytes of new data per day and connect millions of users. To help users cope with the immense scale and influx of new information, OSNs provide a search functionality. However, most of the search engines in OSNs today only support keyword queries and provide basic faceted search capabilities overlooking serendipitous network exploration and search for relationships between OSN entities. This results in siloed information and a limited search space. In 2013 Facebook introduced its innovative Graph Search product with the goal to take the OSN search experience to the next level and facilitate exploration of the Facebook Graph beyond the first degree. In this paper we explore people search on Facebook by analyzing an anonymized social graph, anonymized user profiles, and large scale anonymized query logs generated by users of Facebook Graph Search. We uncover numerous insights about people search across several demographics. We find that named entity and structured queries complement each other across one's duration on Facebook, that females search for people proportionately more than males, and that users submit more queries as they gain more friends. We introduce the concept of a lift predicate and highlight how a graph distance varies with the search goal. Based on these insights, we present a set of design implications to guide the research and development of the OSN search in the future.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question13">
<h4 class="panel-title research-paper">
<strong>Exploiting sequential relationships for familial classification</strong> by <em>Lee S Jensen, James G Shanahan</em>, Proceedings of the 19th ACM international conference on Information and knowledge management <a href="http://dl.acm.org/citation.cfm?id=1871759" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question13" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>The pervasive nature of the internet has caused a significant transformation in the field of genealogical research. This has impacted not only how research is conducted, but has also dramatically increased the number of people discovering their family history. Recent market research (Maritz Marketing 2000, Harris Interactive 2009) indicates that general interest in the United States has increased from 45% in 1996, to 60% in 2000, and 87% in 2009. Increased popularity has caused a dramatic need for improvements in algorithms related to extracting, accessing, and processing genealogical data for use in building family trees. This paper presents one approach to algorithmic improvement in the family history domain, where we infer the familial relationships of households found in human transcribed United States census data. By applying advances made in natural language processing, exploiting the sequential nature of the census, and using state of the art machine learning algorithms, we were able to decrease the error by 35% over a hand coded baseline system. The resulting system is immediately applicable to hundreds of millions of other genealogical records where families are represented, but the familial relationships are missing.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question14">
<h4 class="panel-title research-paper">
<strong>Searching for design examples with crowdsourcing</strong> by <em>N Spirin, M Eslami, J Ding, P Jain, B Bailey, K Karahalios</em>, Proceedings of the 23rd International Conference on World Wide Web <a href="http://dl.acm.org/citation.cfm?id=2577371" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question14" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Examples are very important in design, but existing tools for design example search still do not cover many cases. For instance, long tail queries containing subtle and subjective design concepts, like "calm and quiet", "elegant", "dark background with a hint of color to make it less boring", are poorly supported. This is mainly due to the inherent complexity of the task, which so far has been tackled only algorithmically using general image search techniques. We propose a powerful new approach based on crowdsourcing, which complements existing algorithmic approaches and addresses their shortcomings. Out of many explored crowdsourcing configurations we found that (1) a design need should be represented via several query images and (2) AMT crowd workers should assess a query-specific relevance of a candidate example from a pre-built design collection. To test the utility of our approach, we compared it with Google Images in a query-by-example mode. Based on feedback from expert designers, the crowd selects more relevant design examples.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question15">
<h4 class="panel-title research-paper">
<strong>Unsupervised approach to generate informative structured snippets for job search engines</strong> by <em>N Spirin, K Karahalios</em>, Proceedings of the 22nd international conference on World Wide Web <a href="http://dl.acm.org/citation.cfm?id=2487891" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question15" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Aiming to improve user experience for a job search engine, in this paper we propose an idea to switch from query-biased snippets used by most web search engines to rich structured snippets associated with the main sections of a job posting page, which are more appropriate for job search due to specific user needs and the structure of job pages. We present a very simple yet actionable approach to generate such snippets in an unsupervised way. The advantages of the proposed approach are two-fold: it doesn't require manual annotation and therefore can be easily deployed to many languages, which is a desirable property for a job search engine operating internationally; it fuses naturally with the trend towards Mobile Web where the content needs to be optimized for small screen devices and informativeness.
</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question16">
<h4 class="panel-title research-paper">
<strong>Spatial probabilistic modeling of calls to businesses</strong> by <em>R Hariharan, JM Loh, J Shanahan, K Yamada</em>, Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems <a href="http://dl.acm.org/citation.cfm?id=1869863" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question16" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Local search engines allow users to search for entities such as businesses in a particular geographic location. To improve the geographic relevance of search, user feedback data such as logged click locations are traditionally used. In this paper, we use anonymized mobile call log data as an alternate source of data and investigate its relevance to local search. Such data consists of records of anonymized mobile calls made to local businesses along with the locations of celltowers that handled the calls. We model the probability of calls made to particular categories of businesses as a function of distance, using a generalized linear model framework. We provide a detailed comparison between a click log and a mobile call log, showing its relevance to local search. We describe our probabilistic models and apply them to anonymized mobile call logs for New York City and Los Angeles restaurants.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question17">
<h4 class="panel-title research-paper">
<strong>Location disambiguation in local searches using gradient boosted decision trees</strong> by <em>RJ Agrawal, JG Shanahan</em>, Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems <a href="http://dl.acm.org/citation.cfm?id=1869811" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question17" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Local search is a specialization of the web search that allows users to submit geographically constrained queries. However, one of the challenges for local search engines is to uniquely understand and locate the geographical intent of the query. Geographical constraints (or location references) in a local search are often incomplete and thereby suffer from the referent ambiguity problem where the same location name can mean several different possibilities. For instance, just the term "Springfield" by itself can refer to 30 different cities in the USA. Previous approaches to location disambiguation have generally been hand compiled heuristic models. In this paper, we examine a data-driven, machine learning approach to location disambiguation. Essentially, we separately train a Gradient Boosted Decision Tree (GBDT) model on thousands of desktop and mobile-based local searches and compare the performance to one of our previous heuristic based location disambiguation system (HLDS). The GBDT based approach shows promising results with statistically significant improvements over the HLDS approach. The error rate reduction is about 9% and 22% for the desktop-based and the mobile-based local searches respectively. Additionally, we examine the relative influence of various geographic and non-geographic features that help with the location disambiguation task. It is interesting to note that while the distance between the user and the intended location has been considered as an important variable, the relative influence of distance is secondary to the popularity of the location in the GBDT learnt models.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question18">
<h4 class="panel-title research-paper">
<strong>Learning a Query Parser for Local Web Search</strong> by <em>D Feng, J Shanahan, N Murray, R Zajac</em>, 2010 IEEE Fourth International Conference on Semantic Computing (ICSC) <a href="http://ieeexplore.ieee.org/abstract/document/5629096/" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question18" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Parsing unstructured local web queries is often tackled using simple syntactic rules that tend to be limited and brittle. Here we present a data-driven approach to learning a query parser for local-search (geographical) queries. The learnt model uses class-level ngram language model-based features; these ngram language models, harvested from structured queries logs, insulate the model from surface-level tokens. The proposed approach is compared with a finite state model. Experiments show significant improvements for parsing geographical web queries using these learnt models.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question19">
<h4 class="panel-title research-paper">
<strong>Crowdsourcing local search relevance</strong> by <em>JF Paiement, JG Shanahan, R Zajac</em>, CrowdConf 2010 <a href="http://dl.acm.org/citation.cfm?id=2789993" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question19" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>This paper talks about why local search relevance is important and how to crowdsource local search relevance tasks, as well as what factors have influences on the precision of these tasks.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question20">
<h4 class="panel-title research-paper">
<strong>Document Souls: Joining Personalities to Documents to produce pro-active documents engaged in contextualized, independent search</strong> by <em>G Grefenstette, JG Shanahan</em>, Context-Based Information Retrieval <a href="http://ceur-ws.org/Vol-151/CIR-05_2.pdf" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question20" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>The idea behind the semantic web is that documents will contain additional markup that make explicit the information content of unstructured media. We present here the Document Souls system which allows documents to become animate, actively searching the wider world for more information about their own contents, attaching relevant information to itself as additional markup. A Document Soul is a set of information requests that are attached to a document as annotation. A demon within the system polls these requests and activates search agents that asynchronously respond to unsatisfied requests. To control search and relevance, collections of information requests are packaged as personalities which filter out unwanted information. In this paper, we present the structure of the Document Souls architecture and the function of personalities for performing contextualized search.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question21">
<h4 class="panel-title research-paper">
<strong>Agentized, contextualized Filters for Information management</strong> by <em>DA Evans, G Grefenstette, Y Qu, JG Shanahan, VM Sheftel</em>, Agent-Mediated Knowledge Management <a href="http://link.springer.com/chapter/10.1007/978-3-540-24612-1_16" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question21" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>When people read or write documents, they spontaneously generate new information needs: for example, to understand the text they are reading; to find additional information related to the points they are making in their drafts. Simultaneously, each Information Object (IO) (i.e., word, entity, term, concept, phrase, proposition, sentence, paragraph, section, document, collection, etc.) someone reads or writes also creates context for the other IOs in the same discourse. We present a conceptual model of Agentized, Contextualized Filters (ACFs) - agents that identify an appropriate context for an information object and then actively fetch and filter relevant information concerning the information object in other information sources the user has access to. We illustrate the use of ACFs in a prototype knowledge management system called ViviDocs.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question22">
<h4 class="panel-title research-paper">
<strong>CLARIT Experiments in Batch Filtering: Term Selection and Threshold Optimization in IR and SVM Filters</strong> by <em>DA Evans, JG Shanahan, N Roma, J Bennett, V Sheftel, E Stoica</em>, TREC 2003 <a href="http://trec.nist.gov/pubs/trec11/papers/clairvoyance.evans.pdf" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question22" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>The Clairvoyance team participated in the Filtering Track, submitting two runs in the Batch Filtering category. While we have been exploring the question of both topic modeling and ensemble filter construction (as in our previous TREC filtering experiments [5]), we had one distinct objective this year, to explore the viability of monolithic filters in classification-like tasks. This is appropriate to our work, in part, because monolithic filters are a crucial starting point for ensemble filtering, and it is possible for them to contribute substantially in the ensemble approach. Our primary goal in experiments this year, thus, was to explore two issues in monolithic filter construction: (1) term count selection and (2) filter threshold optimization.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question23">
<h4 class="panel-title research-paper">
<strong>A soft computing approach to road classification</strong> by <em>J Shanahan, B Thomas, M Mirmehdi, T Martin, N Campbell, J Baldwin</em>, Journal of Intelligent & Robotic Systems <a href="http://link.springer.com/article/10.1023%2FA%3A1008158907779?LI=true" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question23" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Current learning approaches to computer vision have mainly focussed on low-level image processing and object recognition, while tending to ignore high-level processing such as understanding. Here we propose an approach to object recognition that facilitates the transition from recognition to understanding. The proposed approach embraces the synergistic spirit of soft computing, exploiting the global search powers of genetic programming to determine fuzzy probabilistic models. It begins by segmenting the images into regions using standard image processing approaches, which are subsequently classified using a discovered fuzzy Cartesian granule feature classifier. Understanding is made possible through the transparent and succinct nature of the discovered models. The recognition of roads in images is taken as an illustrative problem in the vision domain. The discovered fuzzy models while providing high levels of accuracy (97%), also provide understanding of the problem domain through the transparency of the learnt models. The learning step in the proposed approach is compared with other techniques such as decision trees, naive Bayes and neural networks using a variety of performance criteria such as accuracy, understandability and efficiency.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question24">
<h4 class="panel-title research-paper">
<strong>Perceptual organization for inferring object boundaries in an image</strong> by <em>Anca L Ralescu, James G Shanahan</em>, Pattern Recognition <a href="http://www.sciencedirect.com/science/article/pii/S003132039900014X" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question24" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>We are concerned with object recognition in the framework of perceptual organization. The approach presented incorporates a number of concepts from human visual analysis especially the Gestalt laws of organization. Fuzzy techniques are used for the definition and evaluation of the grouping/non-grouping properties as well as for the construction of structures from grouped input tokens. This method takes as input the initially fitted line segments (tokens) and then recursively groups these tokens into higher level structures (tokens) such as lines, u-structures, quadrilaterals, etc. The output high-level structures can then be used to compare with object models and thus lead to object recognition. In this paper inference (grouping) of line segments, line symmetry, junctions, closed regions and strands is presented. The approach is supported by experimental results on 2D images of an office scene environment.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question25">
<h4 class="panel-title research-paper">
<strong>Road Recognition Using Fuzzy Classifiers</strong> by <em>JG Shanahan, BT Thomas, M Mirmehdi, TP Martin, JF Baldwin</em>, BMVC <a href="http://www.xrce.xerox.com/content/download/6808/51860/file/bmvc99.pdf" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question25" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Current learning approaches to computer vision have mainly focussed on low-level image processing and object recognition, while tending to ignore higher level processing for understanding. We propose an approach to scene analysis that facilitates the transition from recognition to understanding. It begins by segmenting the image into regions using standard approaches, which are then classified using a discovered fuzzy Cartesian granule feature classifier. Understanding is made possible through the transparent and succinct nature of the discovered models. The recognition of roads in images is taken as an illustrative problem. The discovered fuzzy models while providing high levels of accuracy (97%), also provide understanding of the problem domain through the transparency of the learnt models. The learning step in the proposed approach is compared with other techniques such as decision trees, naive Bayes and neural networks using a variety of performance criteria such as accuracy, understandability and efficiency.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question26">
<h4 class="panel-title research-paper">
<strong>Controlling with words using automatically identified fuzzy Cartesian granule feature models</strong> by <em>JF Baldwin, TP Martin, JG Shanahan</em>, International journal of approximate reasoning <a href="http://www.sciencedirect.com/science/article/pii/S0888613X9900016X" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question26" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>We present a new approach to representing and acquiring controllers based upon Cartesian granule features - multidimensional features formed over the cross product of words drawn from the linguistic partitions of the constituent input features - incorporated into additive models. Controllers expressed in terms of Cartesian granule features enable the paradigm "controlling with words" by translating process data into words that are subsequently used to interrogate a rule base, which ultimately results in a control action. The system identification of good, parsimonious additive Cartesian granule feature models is an exponential search problem. In this paper we present the GDACG constructive induction algorithm as a means of automatically identifying additive Cartesian granule feature models from example data. GDACG combines the powerful optimisation capabilities of genetic programming with a novel and cheap fitness function, which relies on the semantic separation of concepts expressed in terms of Cartesian granule fuzzy sets, in identifying these additive models. We illustrate the approach on a variety of problems including the modelling of a dynamical process and a chemical plant controller.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question27">
<h4 class="panel-title research-paper">
<strong>Knowledge discovery using cartesian granule features with applications</strong> by <em>JG Shanahan, JF Baldwin, TP Martin</em>, 18th International Conference of the North American Fuzzy Information Processing Society <a href="http://ieeexplore.ieee.org/abstract/document/781688/" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question27" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Current approaches to knowledge discovery can be differentiated based on the discovered models using the following criteria: effectiveness, understandability (to a user or expert in the domain) and evolvability (the ability to adapt over time to a changing environment). Most current approaches satisfy understandability or effectiveness, but not simultaneously while tending to ignore knowledge evolution. We show how knowledge representation based upon Cartesian granule features and a corresponding induction algorithm can effectively address these knowledge discovery criteria (in this paper, the discussion is limited to understandability and effectiveness) across a wide variety of problem domains, including control, image understanding and medical diagnosis.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question28">
<h4 class="panel-title research-paper">
<strong>Automatic fuzzy Cartesian granule feature discovery using genetic programming in image understanding</strong> by <em>JF Baldwin, TP Martin, JG Shanahan</em>, The 1998 IEEE International Conference on Fuzzy Systems Proceeding, IEEE World Congress on Computational Intelligence <a href="http://ieeexplore.ieee.org/abstract/document/686364/" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question28" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Variables defined over Cartesian granule feature universes can be viewed as multidimensional linguistic variables. These variable universes are formed over the cross product of words drawn from the fuzzy partitions of the constituent base features. Here we present a constructive induction algorithm, which identifies not only the Cartesian granule feature model but also the concepts/variables in which the model is expressed. The presented constructive induction algorithm combines the genetic programming search paradigm with a rather novel and cheap fitness function, which is based upon semantic discrimination analysis. Parsimony is promoted in this model discovery process, thereby leading to models with better generalisation power and transparency. The approach is demonstrated on an image understanding problem, an area that has traditionally been dominated by quantitative and black box modelling techniques. Overall the discovered Cartesian granule features models when demonstrated on a large test set of outdoor images provides highly accurate image interpretation, using four input features, with over 78% of the image area labelled correctly.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question29">
<h4 class="panel-title research-paper">
<strong>Learning perceptual organization for straight line segments</strong> by <em>AL Ralescu, JG Shanahan</em>, IEEE International Conference onSystems, Man and Cybernetics, Intelligent Systems for the 21st Century <a href="http://ieeexplore.ieee.org/abstract/document/538346/" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question29" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Address the problem of structure inference in an image, in the framework of perceptual organization. This paper describes work in progress which builds on a subset of the authors' previous work on fuzzy perceptual grouping. More precisely, the authors are concerned with obtaining a fuzzy system which can achieve grouping of line segments. The data are obtained from images and consist of the results of edge extraction to which a line segment fitting algorithm has been applied. For each collection of similar and collinear segments used as input, a representative segment is used to summarize this collection. In the training stage both the input collection and the output segment can be either indicated by a human user, or obtained by overlaying the segment and real images.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question30">
<h4 class="panel-title research-paper">
<strong>Large Scale Distributed Data Science Using Apache Spark</strong> by <em>James G Shanahan, Laing Dai</em>, Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining <a href="http://dl.acm.org/citation.cfm?id=2789993" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question30" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Apache Spark is an open-source cluster computing framework for big data processing. It has emerged as the next generation big data processing engine, overtaking Hadoop MapReduce which helped ignite the big data revolution. Spark maintains MapReduce's linear scalability and fault tolerance, but extends it in a few important ways: it is much faster (100 times faster for certain applications), much easier to program in due to its rich APIs in Python, Java, Scala (and shortly R), and its core data abstraction, the distributed data frame, and it goes far beyond batch applications to support a variety of compute-intensive tasks, including interactive queries, streaming, machine learning, and graph processing. This tutorial will provide an accessible introduction to Spark and its potential to revolutionize academic and commercial data science practices.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
</div><!--/panel-->
</div><!--/panel-group-->
</div> <!-- /.col-md-8 -->
<div class="col-md-8 col-md-offset-2 research-publications">
<header>
<h2 id="machine-learning">Machine Learning</h2>
</header>
<div class="panel-group" id="faqAccordion">
<div class="panel panel-default ">
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question31">
<h4 class="panel-title research-paper">
<strong>Large Scale Distributed Data Science Using Apache Spark</strong> by <em>James G Shanahan, Laing Dai</em>, Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining <a href="http://dl.acm.org/citation.cfm?id=2789993" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question31" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Apache Spark is an open-source cluster computing framework for big data processing. It has emerged as the next generation big data processing engine, overtaking Hadoop MapReduce which helped ignite the big data revolution. Spark maintains MapReduce's linear scalability and fault tolerance, but extends it in a few important ways: it is much faster (100 times faster for certain applications), much easier to program in due to its rich APIs in Python, Java, Scala (and shortly R), and its core data abstraction, the distributed data frame, and it goes far beyond batch applications to support a variety of compute-intensive tasks, including interactive queries, streaming, machine learning, and graph processing. This tutorial will provide an accessible introduction to Spark and its potential to revolutionize academic and commercial data science practices.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question32">
<h4 class="panel-title research-paper">
<strong>Survey on web spam detection: principles and algorithms</strong> by <em>N Spirin, J Han</em>, ACM SIGKDD Explorations Magazine<a href="http://dl.acm.org/citation.cfm?id=2207252" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question32" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Search engines became a de facto place to start information acquisition on the Web. Though due to web spam phenomenon, search results are not always as good as desired. Moreover, spam evolves that makes the problem of providing high quality search even more challenging. Over the last decade research on adversarial information retrieval has gained a lot of interest both from academia and industry. In this paper we present a systematic review of web spam detection techniques with the focus on algorithms and underlying principles. We categorize all existing algorithms into three categories based on the type of information they use: content-based methods, link-based methods, and methods based on non-traditional data such as user behaviour, clicks, HTTP sessions. In turn, we perform a subcategorization of link-based category into five groups based on ideas and principles used: labels propagation, link pruning and reweighting, labels refinement, graph regularization, and featurebased. We also define the concept of web spam numerically and provide a brief survey on various spam forms. Finally, we summarize the observations and underlying principles applied for web spam detection.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question33">
<h4 class="panel-title research-paper">
<strong>Learning to rank with nonlinear monotonic ensemble</strong> by <em>N Spirin, K Vorontsov</em>, International Workshop on Multiple Classifier Systems <a href="http://link.springer.com/chapter/10.1007/978-3-642-21557-5_4" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question33" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Over the last decade learning to rank (L2R) has gained a lot of attention and many algorithms have been proposed. One of the most successful approach is to build an algorithm following the ensemble principle. Boosting is the key representative of this approach. However, even boosting isn't effective when used to increase the performance of individually strong algorithms, scenario when we want to blend already successful L2R algorithms in order to gain an additional benefit. To address this problem we propose a novel algorithm, based on a theory of nonlinear monotonic ensembles, which is able to blend strong base rankers effectively. Specifically, we provide the concept of defect of a set of algorithms that allows to deduce a popular pairwise approach in strict mathematical terms. Using the concept of defect, we formulate an optimization problem and propose a sound method of its solution. Finally, we conduct experiments with real data which shows the effectiveness of the proposed approach.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question34">
<h4 class="panel-title research-paper">
<strong>Proceedings ACM 17th Conference on Information and Knowledge Management</strong> by <em>JG Shanahan, S Amer-Yahia, Y Zhang, A Kolcz, A Chowdury, D Kelly</em>, ACM CIKM 2008 <a href="http://dl.acm.org/citation.cfm?id=2789993" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question34" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>The Conference on Information and Knowledge Management (CIKM) provides an international forum for presentation and discussion of research on information and knowledge management, as well as recent advances on data and knowledge bases. The purpose of the conference is to identify challenging problems facing the development of future knowledge and information systems, and to shape future directions of research by soliciting and reviewing high quality, applied and theoretical research findings. An important part of the conference is the Workshops program which focuses on timely research challenges and initiatives.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question35">
<h4 class="panel-title research-paper">
<strong>Probabilistic workflow mining</strong> by <em>R Silva, J Zhang, JG Shanahan</em>, Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining <a href="http://dl.acm.org/citation.cfm?id=1081903" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question35" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>In several organizations, it has become increasingly popular to document and log the steps that makeup a typical business process. In some situations, a normative workflow model of such processes is developed, and it becomes important to know if such a model is actually being followed by analyzing the available activity logs. In other scenarios, no model is available and, with the purpose of evaluating cases or creating new production policies, one is interested in learning a workflow representation of such activities. In either case, machine learning tools that can mine workflow models are of great interest and still relatively unexplored. We present here a probabilistic workflow model and a corresponding learning algorithm that runs in polynomial time. We illustrate the algorithm on example data derived from a real world workflow.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question36">
<h4 class="panel-title research-paper">
<strong>The 2004 AAAI Spring Symposium Series</strong> by <em>Lola Canamero, Zachary Dodds, Lloyd Greenwald, James Gunderson, Ayanna Howard, Eva Hudlicka, Cheryl Martin, Lynn Parker, Tim Oates, Terry Payne, Yan Qu, Craig Schlenoff, James G Shanahan, Sheila Tejada, Jerry Weinberg, Janyce Wiebe</em>, AI Magazine <a href="https://vvvvw.aaai.org/ojs/index.php/aimagazine/article/view/1788" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question36" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>The Association for the Advancement of Artificial Intelligence, in cooperation with Stanford University's Department of Computer Science, presented the 2004 Spring Symposium Series, Monday through Wednesday, March 22-24, at Stanford University. The titles of the eight symposia were (1) Accessible Hands-on Artificial Intelligence and Robotics Education; (2) Architectures for Modeling Emotion: Cross-Disciplinary Foundations; (3) Bridging the Multiagent and Multirobotic Research Gap; (4) Exploring Attitude and Affect in Text: Theories and Applications; (5) Interaction between Humans and Autonomous Systems over Extended Operation; (6) Knowledge Representation and Ontologies for Autonomous Systems; (7) Language Learning: An Interdisciplinary Perspective; and (8) Semantic Web Services. Each symposium had limited attendance. Most symposia chairs elected to create AAAI technical reports of their symposium, which are available as paperbound reports or (for AAAI members) are downloadable on the AAAI members-only Web site. This report includes summaries of the eight symposia, written by the symposia chairs.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question37">
<h4 class="panel-title research-paper">
<strong>Knowledge discovery with words using Cartesian granule features: an analysis for classification problems</strong> by <em>JG Shanahan</em>, Data mining, rough sets and granular computing <a href="http://link.springer.com/chapter/10.1007/978-3-7908-1791-1_3#page-1" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question37" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Cartesian granule features were originally introduced to address some of the shortcomings of existing forms of knowledge representation such as decomposition error and transparency, and also to enable the paradigm modelling with words through related learning algorithms. This chapter presents a detailed analysis of the impact of granularity on Cartesian granule features models that are learned from example data in the context of classification problems. This analysis provides insights on how to effectively model problems using Cartesian granule features using various levels of granulation, granule characterizations, granule dimensionalies and granule generation techniques. Other modelling with words approaches such as the data browser [1, 2] and fuzzy probabilistic decision trees [3] are also examined and compared. In addition, this chapter provides a useful platform for understanding many other learning algorithms that may or may not explicitly manipulate fuzzy events. For example, it is shown how a naive Bayes classifier is equivalent to crisp Cartesian granule feature classifiers under certain conditions.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
</div><!--/panel-->
</div><!--/panel-group-->
</div> <!-- /.col-md-8 -->
<div class="col-md-8 col-md-offset-2 research-publications">
<header>
<h2 id="text-mining">Text Mining</h2>
</header>
<div class="panel-group" id="faqAccordion">
<div class="panel panel-default ">
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question38">
<h4 class="panel-title research-paper">
<strong>Estimating the Expected Effectiveness of Text Classification Solutions under Subclass Distribution Shifts</strong> by <em>N Lipka, B Stein, JG Shanahan</em>, 2012 IEEE 12th International Conference on Data Mining (ICDM) <a href="http://ieeexplore.ieee.org/abstract/document/6413823/" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question38" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Automated text classification is one of the most important learning technologies to fight information overload. However, the information society is not only confronted with an information flood but also with an increase in "information volatility", by which we understand the fact that kind and distribution of a data source's emissions can significantly vary. In this paper we show how to estimate the expected effectiveness of a classification solution when the underlying data source undergoes a shift in the distribution of its subclasses (modes). Subclass distribution shifts are observed among others in online media such as tweets, blogs, or news articles, where document emissions follow topic popularity. To estimate the expected effectiveness of a classification solution we partition a test sample by means of clustering. Then, using repetitive resampling with different margin distributions over the clustering, the effectiveness characteristics is studied. We show that the effectiveness is normally distributed and introduce a probabilistic lower bound that is used for model selection. We analyze the relation between our notion of expected effectiveness and the mean effectiveness over the clustering both theoretically and on standard text corpora. An important result is a heuristic for expected effectiveness estimation that is solely based on the initial test sample and that can be computed without resampling.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question39">
<h4 class="panel-title research-paper">
<strong>Computing attitude and affect in text: theory and applications</strong> by <em>JG Shanahan, Y Qu, J Wiebe</em>, The Information Retrieval Series, Springer <a href="http://link.springer.com/book/10.1007/1-4020-4102-0" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question39" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Human Language Technology (HLT) and Natural Language Processing (NLP) systems have typically focused on the "factual" aspect of content analysis. Other aspects, including pragmatics, opinion, and style, have received much less attention. However, to achieve an adequate understanding of a text, these aspects cannot be ignored. The chapters in this book address the aspect of subjective opinion, which includes identifying different points of view, identifying different emotive dimensions, and classifying text by opinion. Various conceptual models and computational methods are presented. The models explored in this book include the following: distinguishing attitudes from simple factual assertions; distinguishing between the author's reports from reports of other people's opinions; and distinguishing between explicitly and implicitly stated attitudes. In addition, many applications are described that promise to benefit from the ability to understand attitudes and affect, including indexing and retrieval of documents by opinion; automatic question answering about opinions; analysis of sentiment in the media and in discussion groups about consumer products, political issues, etc.; brand and reputation management; discovering and predicting consumer and voting trends; analyzing client discourse in therapy and counseling; determining relations between scientific texts by finding reasons for citations; generating more appropriate texts and making agents more believable; and creating writers' aids. The studies reported here are carried out on different languages such as English, French, Japanese, and Portuguese. Difficult challenges remain, however. It can be argued that analyzing attitude and affect in text is an "NLP"-complete problem. The interpretation of attitude and affect depends on audience, context, and world knowledge. In addition, there is much yet to learn about the psychological and biological relationships between emotion and language. To continue to progress in this area in NLP, more comprehensive theories of emotion, attitude and opinion are needed, as are lexicons of affective terms and knowledge of how such terms are used in context, and annotated corpora for training and evaluation. This book is a first foray into this area; it grew out of a symposium on this topic that took place at Stanford University in March, 2004, under support from American Association for Artificial Intelligence (AAAI). Several of the presentations were extended into the chapters that appear here. The chapters in this collection reflect the majors themes of the workshop, corresponding to a balance among conceptual models, computational methods, and applications. The chapters in this book are organized along these themes into three broad, overlapping parts.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question40">
<h4 class="panel-title research-paper">
<strong>Validating the Coverage of Lexical Resources for Affect Analysis and Automatically Classifying New Words along Semantic Axes</strong> by <em>G Grefenstette, Y Qu, DA Evans, JG Shanahan</em>, The Information Retrieval Series, Springer <a href="http://link.springer.com/chapter/10.1007/1-4020-4102-0_9" class="ing research-paper">[link]</a>
</div> <!-- /.panel-heading -->
<div id="question40" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>In addition to factual content, many texts contain an emotional dimension. This emotive, or affect, dimension has not received a great amount of attention in computational linguistics until recently. However, now that messages (including spam) have become more prevalent than edited texts (such as newswire), recognizing this emotive dimension of written text is becoming more important. One resource needed for identifying affect in text is a lexicon of words with emotion-conveying potential. Starting from an existing affect lexicon and lexical patterns that invoke affect, we gathered a large quantity of text to measure the coverage of our existing lexicon. This chapter reports on our methods for identifying new candidate affect words and on our evaluation of our current affect lexicons. We describe how our affect lexicon can be extended based on results from these experiments.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question41">
<h4 class="panel-title research-paper">
<strong>Future short term goals of research in computational analysis of stylistics in text</strong> by <em>S Argamon, J Karlgren, JG Shanahan</em>, ACM SIGIR Forum <a href="http://dl.acm.org/citation.cfm?id=1113347" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question41" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>The first workshop on stylistic analysis of text for information access was held on the day following the 2005 SIGIR conference. This workshop addressed the automatic analysis and extraction of stylistic aspects of natural language texts. Style, roughly defined as the 'manner' in which something is expressed, as opposed to the 'content' of a message is usually disregarded by information access applications as having no bearing on the target notion of relevance: systems have typically focused on the "factual" aspect of content analysis.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question42">
<h4 class="panel-title research-paper">
<strong>Stylistic analysis of text for information access</strong> by <em>S Argamon, J Karlgren, JG Shanahan</em>, Swedish Institute of Computer Science <a href="http://soda.swedish-ict.se/2373/1/SICS-T--2005-14--SE.pdf" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question42" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Papers from the workshop held in conjunction with the 28th Annual International ACM Conference on Research and Development in Information Retrieval, August 13-19, 2005, Salvador, Bahia, Brazil</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question43">
<h4 class="panel-title research-paper">
<strong>Coupling niche browsers and affect analysis for an opinion mining application</strong> by <em>G Grefenstette, Y Qu, JG Shanahan, DA Evans</em>, Coupling approaches, coupling media and coupling languages for information retrieval <a href="http://dl.acm.org/citation.cfm?id=2816290" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question43" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Newspapers generally attempt to present the news objectively. But textual affect analysis shows that many words carry positive or negative emotional charge. In this article, we show that coupling niche browsing technology and affect analysis technology allows us to create a new application that measures the slant in opinion given to public figures in the popular press.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question44">
<h4 class="panel-title research-paper">
<strong>Modelling with Words: Learning, fusion, and reasoning within a formal linguistic representation framework</strong> by <em>J Lawry, JG Shanahan, AL Ralescu</em>, Fusion, and Reasoning Within a Formal Linguistic Representation Framework, Springer <a href="http://www.springer.com/us/book/9783540204879" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question44" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Modelling with Words is an emerging modelling methodology closely related to the paradigm of Computing with Words introduced by Lotfi Zadeh. This book is an authoritative collection of key contributions to the new concept of Modelling with Words. A wide range of issues in systems modelling and analysis is presented, extending from conceptual graphs and fuzzy quantifiers to humanist computing and self-organizing maps. Among the core issues investigated are: balancing predictive accuracy and high level transparency in learning, scaling linguistic algorithms to high-dimensional data problems, integrating linguistic expert knowledge with knowledge derived from data, identifying sound and useful inference rules, integrating fuzzy and probabilistic uncertainty in data modelling.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question45">
<h4 class="panel-title research-paper">
<strong>TREC 2004 HARD Track Experiments in Clustering</strong> by <em>DA Evans, J Bennett, J Montgomery, V Sheftel, DA Hull, JG Shanahan</em>, TREC 2004 <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.62.3498&rep=rep1&type=pdf" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question45" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>The Clairvoyance team participated in the High Accuracy Retrieval from Documents (HARD) Track of TREC 2004, submitting three runs. The principal hypothesis we have been pursuing is that small numbers of documents in clusters can provide a better basis for relevance feedback than ranked lists or, alternatively, than top-N pseudo-relevance feedback (PRF). Clustering of a query response can yield one or more groups of documents in which there are "significant" numbers (greater than 30%) of relevant documents; we expect the best results when such groups are selected for feedback. Following up on work we began in our TREC-2003 HARD-Track experiments [Shanahan et al. 2004], therefore, we continued to explore approaches to clustering query response sets to concentrate relevant documents, with the goal of (a) providing users (assessors) with better sets of documents to judge and (b) making the choices among sets easier to evaluate. Our experiments, thus, focused primarily on exploiting assessor feedback through clarification forms for query expansion and largely ignored other features of the documents or metadata.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question46">
<h4 class="panel-title research-paper">
<strong>Mining multilingual opinions through classification and translation</strong> by <em>JG Shanahan, G Grefenstette, Y Qu, DA Evans</em>, Proceedings of AAAI Spring Symposium on Exploring Attitude and Affect in Text <a href="https://vvvvw.aaai.org/Papers/Symposia/Spring/2004/SS-04-07/SS04-07-027.pdf" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question46" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Today, much of product feedback is provided by customers/critiques online through websites, discussion boards, mailing lists, and blogs. People trying to make strategic decisions (e.g., a product launch, a purchase) will find that a web search will return many useful but heterogeneous and, increasingly, multilingual opinions on a product. Generally, the user will find it very difficult and time consuming to assimilate all available information and make an informed decision. To date, most work in automating this process has focused on monolingual texts and users. This extended abstract describes our preliminary work on mining product ratings in a multilingual setting. The proposed approaches are automatic, using a combination of techniques from classification and translation, thereby alleviating human-intensive construction and maintenance of linguistic resources.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question47">
<h4 class="panel-title research-paper">
<strong>Boosting support vector machines for text classification through parameter-free threshold relaxation</strong> by <em>JG Shanahan, N Roma</em>, Proceedings of the 12th International Conference on Information and Knowledge Management <a href="http://dl.acm.org/citation.cfm?id=956911" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question47" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Support vector machine (SVM) learning algorithms focus on finding the hyperplane that maximizes the margin (the distance from the separating hyperplane to the nearest examples) since this criterion provides a good upper bound of the generalization error. When applied to text classification, these learning algorithms lead to SVMs with excellent precision but poor recall. Various relaxation approaches have been proposed to counter this problem including: asymmetric SVM learning algorithms (soft SVMs with asymmetric misclassification costs); uneven margin based learning; and thresholding. A review of these approaches is presented here. In addition, in this paper, we describe a new threshold relaxation algorithm. This approach builds on previous thresholding work based upon the beta-gamma algorithm. The proposed thresholding strategy is parameter free, relying on a process of retrofitting and cross validation to set algorithm parameters empirically, whereas our previous approach required the specification of two parameters (beta and gamma). The proposed approach is more efficient, does not require the specification of any parameters, and similarly to the parameter-based approach, boosts the performance of baseline SVMs by at least 20% for standard information retrieval measures.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question48">
<h4 class="panel-title research-paper">
<strong>Improving SVM text classification performance through threshold adjustment</strong> by <em>JG Shanahan, N Roma</em>, Proceedings of the European Conference on Machine Learning <a href="http://link.springer.com/chapter/10.1007/978-3-540-39857-8_33#page-1" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question48" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>In general, support vector machines (SVM), when applied to text classification provide excellent precision, but poor recall. One means of customizing SVMs to improve recall, is to adjust the threshold associated with an SVM. We describe an automatic process for adjusting the thresholds of generic SVM which incorporates a user utility model, an integral part of an information management system. By using thresholds based on utility models and the ranking properties of classifiers, it is possible to overcome the precision bias of SVMs and insure robust performance in recall across a wide variety of topics, even when training data are sparse. Evaluations on TREC data show that our proposed threshold adjusting algorithm boosts the performance of baseline SVMs by at least 20% for standard information retrieval measures.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question49">
<h4 class="panel-title research-paper">
<strong>Clairvoyance Corporation Experiments in the TREC 2003 High Accuracy Retrieval from Douments (HARD) Track.</strong> by <em>JG Shanahan, J Bennett, DA Evans, DA Hull, J Montgomery</em>, TREC 2003 <a href="http://trec.nist.gov/pubs/trec12/papers/clairvoyance.hard.pdf" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question49" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>The Clairvoyance team participated in the HARD Track, submitting fifteen runs. Our experiments focused primarily on exploiting user feedback through clarification forms for query expansion. We made limited use of the genre and related text metadata. Within the clarification form feedback framework, we explored the cluster hypothesis in the context of relevance feedback. The cluster hypothesis states that closely associated documents tend to be relevant to the same requests [Van Rijsbergen, 1979]. With this in mind we investigated the impact on performance of exploiting user feedback on groups of documents (i.e., organizing the top retrieved documents for a query into intuitive groups through agglomerative clustering or document-centric clustering), as an alternative to a ranked list of titles. This forms the basis for a new blind feedback mechanism (used to expand queries) based upon clusters of documents, as an alternative to blind feedback based upon taking the top N ranked documents, an approach that is commonly used.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question50">
<h4 class="panel-title research-paper">
<strong>Topic structure modeling</strong> by <em>DA Evans, JG Shanahan, V Sheftel</em>, Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval <a href="http://dl.acm.org/citation.cfm?id=564472" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question50" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>In this paper, we present a method based on document probes to quantify and diagnose topic structure, distinguishing topics as monolithic, structured, or diffuse. The method also yields a structure analysis that can be used directly to optimize filter (classifier) creation. Preliminary results illustrate the predictive value of the approach on TREC/Reuters-96 topics.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question51">
<h4 class="panel-title research-paper">
<strong>Modeling with words: an approach to text categorization</strong> by <em>James G Shanahan</em>, The 10th IEEE International Conference on Fuzzy Systems <a href="http://ieeexplore.ieee.org/abstract/document/1007246/" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question51" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>Traditionally, fuzzy set-based approaches have performed excellently in modeling small to medium scale problem domains. This paper examines the scalability of fuzzy systems to a large-scale problem that is inherently vague and of text categorization. The paper presents two fuzzy probabilistic approaches to text classification and the corresponding machine learning algorithms to learn such systems from example data. The first approach follows the traditional fuzzy set paradigm, while the second approach fits within the modeling with words paradigm using granule features to represent the text problem domain.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
<!-- PUBLICATION START -->
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question52">
<h4 class="panel-title research-paper">
<strong>Topic-Specific Optimization and Structuring</strong> by <em>DA Evans, JG Shanahan, X Tong, N Roma, E Stoica, V Sheftel</em>, TREC 2001 <a href="http://trec.nist.gov/pubs/trec10/papers/CLARIT_TREC-2001_Filtering_Final.pdf" class="ing research-paper">[link]</a>
</h4>
</div> <!-- /.panel-heading -->
<div id="question52" class="panel-collapse collapse" style="height: 0px;">
<div class="panel-body">
<p>The Clairvoyance team participated in the Filtering Track, submitting the maximum number of runs in each of the filtering categories: Adaptive, Batch, and Routing. We had two distinct goals this year: (1) to establish the generalizability of our approach to adaptive filtering and (2) to experiment with relatively more "radical" approaches to batch filtering using ensembles of filters. Our routing runs served principally to establish an internal basis for comparisons in performance to adaptive and batch efforts and are not discussed in this report.</p>
</div> <!-- /.panel-body -->
</div> <!-- /#question1 -->
</div><!--/panel-->
</div><!--/panel-group-->
</div> <!-- /.col-md-8 -->
<div class="col-md-8 col-md-offset-2 research-publications">
<header>
<h2 id="patents">Patents</h2>
</header>
<div class="panel-group" id="faqAccordion">
<div class="panel panel-default ">
<div class="panel-heading accordion-toggle question-toggle collapsed" data-toggle="collapse" data-parent="#faqAccordion" data-target="#question53">
<h4 class="panel-title research-paper">