-
Notifications
You must be signed in to change notification settings - Fork 1
/
htmLawed_README.txt
1744 lines (1081 loc) · 114 KB
/
htmLawed_README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
/*
htmLawed_README.txt, 9 June 2015
htmLawed 1.1.20, 9 June 2015
Copyright Santosh Patnaik
Dual licensed with LGPL 3 and GPL 2+
A PHP Labware internal utility - http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed
*/
== Content ==========================================================
1 About htmLawed
1.1 Example uses
1.2 Features
1.3 History
1.4 License & copyright
1.5 Terms used here
2 Usage
2.1 Simple
2.2 Configuring htmLawed using the '$config' parameter
2.3 Extra HTML specifications using the '$spec' parameter
2.4 Performance time & memory usage
2.5 Some security risks to keep in mind
2.6 Use without modifying old 'kses()' code
2.7 Tolerance for ill-written HTML
2.8 Limitations & work-arounds
2.9 Examples of usage
3 Details
3.1 Invalid/dangerous characters
3.2 Character references/entities
3.3 HTML elements
3.3.1 HTML comments and 'CDATA' sections
3.3.2 Tag-transformation for better XHTML-Strict
3.3.3 Tag balancing and proper nesting
3.3.4 Elements requiring child elements
3.3.5 Beautify or compact HTML
3.4 Attributes
3.4.1 Auto-addition of XHTML-required attributes
3.4.2 Duplicate/invalid 'id' values
3.4.3 URL schemes (protocols) and scripts in attribute values
3.4.4 Absolute & relative URLs
3.4.5 Lower-cased, standard attribute values
3.4.6 Transformation of deprecated attributes
3.4.7 Anti-spam & 'href'
3.4.8 Inline style properties
3.4.9 Hook function for tag content
3.5 Simple configuration directive for most valid XHTML
3.6 Simple configuration directive for most `safe` HTML
3.7 Using a hook function
3.8 Obtaining `finalized` parameter values
3.9 Retaining non-HTML tags in input with mixed markup
4 Other
4.1 Support
4.2 Known issues
4.3 Change-log
4.4 Testing
4.5 Upgrade, & old versions
4.6 Comparison with 'HTMLPurifier'
4.7 Use through application plug-ins/modules
4.8 Use in non-PHP applications
4.9 Donate
4.10 Acknowledgements
5 Appendices
5.1 Characters discouraged in HTML
5.2 Valid attribute-element combinations
5.3 CSS 2.1 properties accepting URLs
5.4 Microsoft Windows 1252 character replacements
5.5 URL format
5.6 Brief on htmLawed code
== 1 About htmLawed ================================================
htmLawed is a PHP script to process text with HTML markup to make it more compliant with HTML standards and administrative policies. It works by making HTML well-formed with balanced and properly nested tags, neutralizing code that may be used for cross-site scripting (XSS) attacks, allowing only specified HTML tags and attributes, and so on. Such `lawing in` of HTML in text used in (X)HTML or XML documents ensures that it is in accordance with the aesthetics, safety and usability requirements set by administrators.
htmLawed is highly customizable, and fast with low memory usage. Its free and open-source code is in one small file, does not require extensions or libraries, and works in older versions of PHP as well. It is a good alternative to the HTML Tidy:- http://tidy.sourceforge.net application.
-- 1.1 Example uses ------------------------------------------------
* Filtering of text submitted as comments on blogs to allow only certain HTML elements
* Making RSS/Atom newsfeed item-content standard-compliant: often one uses an excerpt from an HTML document for the content, and with unbalanced tags, non-numerical entities, etc., such excerpts may not be XML-compliant
* Text processing for stricter XML standard-compliance: e.g., to have lowercased 'x' in hexadecimal numeric entities becomes necessary if an XHTML document with MathML content needs to be served as 'application/xml'
* Scraping text or data from web-pages
* Pretty-printing HTML code
-- 1.2 Features ---------------------------------------------------o
Key: '*' security feature, '^' standard compliance, '~' requires setting right options, '`' different from 'Kses'
* make input more *secure* and *standard-compliant*
* use for HTML 4, XHTML 1.0 or 1.1, or even generic *XML* documents ^~`
* *beautify* or *compact* HTML ^~`
* can *restrict elements* ^~`
* ensures proper closure of empty elements like 'img' ^`
* *transform deprecated elements* like 'u' ^~`
* HTML *comments* and 'CDATA' sections can be permitted ^~`
* elements like 'script', 'object' and 'form' can be permitted ~
* *restrict attributes*, including *element-specifically* ^~`
* remove *invalid attributes* ^`
* element and attribute names are *lower-cased* ^
* provide *required attributes*, like 'alt' for 'image' ^`
* *transforms deprecated attributes* ^~`
* attributes *declared only once* ^`
* *restrict attribute values*, including *element-specifically* ^~`
* a value is declared for `empty` (`minimized`) attributes like 'checked' ^
* check for potentially dangerous attribute values *~
* ensure *unique* 'id' attribute values ^~`
* *double-quote* attribute values ^
* lower-case *standard attribute values* like 'password' ^`
* permit custom, non-standard attributes as well as custom rules for standard attributes ~`
* *attribute-specific URL protocol/scheme restriction* *~`
* disable *dynamic expressions* in 'style' values *~`
* neutralize invalid named character entities ^`
* *convert* hexadecimal numeric entities to decimal ones, or vice versa ^~`
* convert named entities to numeric ones for generic XML use ^~`
* remove *null* characters *
* neutralize potentially dangerous proprietary Netscape *Javascript entities* *
* replace potentially dangerous *soft-hyphen* character in URL-accepting attribute values with spaces *
* remove common *invalid characters* not allowed in HTML or XML ^`
* replace *characters from Microsoft applications* like 'Word' that are discouraged in HTML or XML ^~`
* neutralize entities for characters invalid or discouraged in HTML or XML ^`
* appropriately neutralize '<', '&', '"', and '>' characters ^*`
* understands improperly spaced tag content (like, spread over more than a line) and properly spaces them `
* attempts to *balance tags* for well-formedness ^~`
* understands when *omitable closing tags* like '</p>' (allowed in HTML 4, transitional, e.g.) are missing ^~`
* attempts to permit only *validly nested tags* ^~`
* option to *remove or neutralize bad content* ^~`
* attempts to *rectify common errors of plain-text misplacement* (e.g., directly inside 'blockquote') ^~`
* fast, *non-OOP* code of ~45 kb incurring peak basal memory usage of ~0.5 MB
* *compatible* with pre-existing code using 'Kses' (the filter used by 'WordPress')
* optional *anti-spam* measures such as addition of 'rel="nofollow"' and link-disabling ~`
* optionally makes *relative URLs absolute*, and vice versa ~`
* optionally mark '&' to identify the entities for '&', '<' and '>' introduced by htmLawed ~`
* allows deployment of powerful *hook functions* to *inject* HTML, *consolidate* 'style' attributes to 'class', finely check attribute values, etc. ~`
* *independent of character encoding* of input and does not affect it
* *tolerance for ill-written HTML* to a certain degree
-- 1.3 History ----------------------------------------------------o
htmLawed was created in 2007 for use with 'LabWiki', a wiki software developed at PHP Labware, as a suitable software could not be found. Existing PHP software like 'Kses' and 'HTMLPurifier' were deemed inadequate, slow, resource-intensive, or dependent on an extension or external application like 'HTML Tidy'. The core logic of htmLawed, that of identifying HTML elements and attributes, was based on the 'Kses' (version 0.2.2) HTML filter software of Ulf Harnhammar (it can still be used with code that uses 'Kses'; see section:- #2.6.).
See section:- #4.3 for a detailed log of changes in htmLawed over the years, and section:- #4.10 for acknowledgements.
-- 1.4 License & copyright ----------------------------------------o
htmLawed is free and open-source software dual copyrighted by Santosh Patnaik, MD, PhD, and licensed under LGPL license version 3:- http://www.gnu.org/licenses/lgpl-3.0.txt, and GPL license version 2:- http://www.gnu.org/licenses/gpl-2.0.txt (or later).
-- 1.5 Terms used here --------------------------------------------o
In this document, only HTML body-level elements are considered. htmLawed does not have support for head-level elements, 'body', and the frame-level elements, 'frameset', 'frame' and 'noframes', and these elements are ignored here.
* `administrator` - or admin; person setting up the code that utilizes htmLawed; also, `user`
* `attributes` - name-value pairs like 'href="http://x.com"' in opening tags
* `author` - see `writer`
* `character` - atomic unit of text; internally represented by a numeric `code-point` as specified by the `encoding` or `charset` in use
* `entity` - markup like '>' and ' ' used to refer to a character
* `element` - HTML element like 'a' and 'img'
* `element content` - content between the opening and closing tags of an element, like 'click' of the '<a href="x">click</a>' element
* `HTML` - implies XHTML unless specified otherwise
* `HTML body` - Complete HTML documents typically have a `head` and a `body` container. Information in `head` specifies title of the document, etc., whereas that in the body informs what is to be displayed on a web-page; it is only the elements for `body`, except 'frames', 'frameset' and 'noframes' that htmLawed is concerned with
* `input` - text given to htmLawed to process
* `processing` - involves filtering, correction, etc., of input
* `safe` - absence or reduction of certain characters and HTML elements and attributes in HTML of text that can otherwise potentially, and circumstantially, expose text readers to security vulnerabilities like cross-site scripting attacks (XSS)
* `scheme` - a URL protocol like 'http' and 'ftp'
* `specifications` - standard specifications, for HTML4, HTML5, Ruby, etc.
* `style property` - terms like 'border' and 'height' for which declarations are made in values for the 'style' attribute of elements
* `tag` - markers like '<a href="x">' and '</a>' delineating element content; the opening tag can contain attributes
* `tag content` - consists of tag markers '<' and '>', element names like 'div', and possibly attributes
* `user` - administrator
* `writer` - end-user like a blog commenter providing the input that is to be processed; also, `author`
-- 1.6 Availability ------------------------------------------------o
htmLawed can be downloaded for free at its website:- http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed. Besides the 'htmLawed.php' file, the download has the htmLawed documentation (this document) in plain text:- htmLawed_README.txt and HTML:- htmLawed_README.htm formats, a script for testing:- htmLawedTest.php, and a text file for test-cases:- htmLawed_TESTCASE.txt. htmLawed is also available as a PHP class (OOP code) on its website.
== 2 Usage ========================================================oo
htmLawed works in PHP version 4.4 or higher. Either 'include()' the 'htmLawed.php' file, or copy-paste the entire code. To use with PHP 4.3, have the following code included:
if(!function_exists('ctype_digit')){
function ctype_digit($var){
return ((int) $var == $var);
}
}
-- 2.1 Simple ------------------------------------------------------
The input text to be processed, '$text', is passed as an argument of type string; 'htmLawed()' returns the processed string:
$processed = htmLawed($text);
With the 'htmLawed class' (section:- #1.6), usage is:
$processed = htmLawed::hl($text);
*Notes*: (1) If input is from a '$_GET' or '$_POST' value, and 'magic quotes' are enabled on the PHP setup, run 'stripslashes()' on the input before passing to htmLawed. (2) htmLawed does not have support for head-level elements, 'body', and the frame-level elements, 'frameset', 'frame' and 'noframes'.
By default, htmLawed will process the text allowing all valid HTML elements/tags, secure URL scheme/CSS style properties, etc. It will allow 'CDATA' sections and HTML comments, balance tags, and ensure proper nesting of elements. Such actions can be configured using two other optional arguments -- '$config' and '$spec':
$processed = htmLawed($text, $config, $spec);
The '$config' and '$spec' arguments are detailed below. Some examples are shown in section:- #2.9. For maximum protection against 'XSS' and other scripting attacks (e.g., by disallowing Javascript code), consider using the 'safe' parameter; see section:- #3.6.
-- 2.2 Configuring htmLawed using the '$config' parameter ---------o
'$config' instructs htmLawed on how to tackle certain tasks. When '$config' is not specified, or not set as an array (e.g., '$config = 1'), htmLawed will take default actions. One or many of the task-action or value-specification pairs can be specified in '$config' as array key-value pairs. If a parameter is not specified, htmLawed will use the default value/action indicated further below.
$config = array('comment'=>0, 'cdata'=>1);
$processed = htmLawed($text, $config);
Or,
$processed = htmLawed($text, array('comment'=>0, 'cdata'=>1));
Below are the possible value-specification combinations. In PHP code, values that are integers should not be quoted and should be used as numeric types (unless meant as string/text).
Key: '*' default, '^' different default when htmLawed is used in the Kses-compatible mode (see section:- #2.6), '~' different default when 'valid_xhtml' is set to '1' (see section:- #3.5), '"' different default when 'safe' is set to '1' (see section:- #3.6)
*abs_url*
Make URLs absolute or relative; '$config["base_url"]' needs to be set; see section:- #3.4.4
'-1' - make relative
'0' - no action *
'1' - make absolute
*and_mark*
Mark '&' characters in the original input; see section:- #3.2
*anti_link_spam*
Anti-link-spam measure; see section:- #3.4.7
'0' - no measure taken *
`array("regex1", "regex2")` - will ensure a 'rel' attribute with 'nofollow' in its value in case the 'href' attribute value matches the regular expression pattern 'regex1', and/or will remove 'href' if its value matches the regular expression pattern 'regex2'. E.g., 'array("/./", "/://\W*(?!(abc\.com|xyz\.org))/")'; see section:- #3.4.7 for more.
*anti_mail_spam*
Anti-mail-spam measure; see section:- #3.4.7
'0' - no measure taken *
`word` - '@' in mail address in 'href' attribute value is replaced with specified `word`
*balance*
Balance tags for well-formedness and proper nesting; see section:- #3.3.3
'0' - no
'1' - yes *
*base_url*
Base URL value that needs to be set if '$config["abs_url"]' is not '0'; see section:- #3.4.4
*cdata*
Handling of 'CDATA' sections; see section:- #3.3.1
'0' - don't consider 'CDATA' sections as markup and proceed as if plain text ^"
'1' - remove
'2' - allow, but neutralize any '<', '>', and '&' inside by converting them to named entities
'3' - allow *
*clean_ms_char*
Replace discouraged characters introduced by Microsoft Word, etc.; see section:- #3.1
'0' - no *
'1' - yes
'2' - yes, but replace special single & double quotes with ordinary ones
*comment*
Handling of HTML comments; see section:- #3.3.1
'0' - don't consider comments as markup and proceed as if plain text ^"
'1' - remove
'2' - allow, but neutralize any '<', '>', and '&' inside by converting to named entities
'3' - allow *
*css_expression*
Allow dynamic CSS expression by not removing the expression from CSS property values in 'style' attributes; see section:- #3.4.8
'0' - remove *
'1' - allow
*deny_attribute*
Denied HTML attributes; see section:- #3.4
'0' - none *
`string` - dictated by values in `string`
'on*' (like 'onfocus') attributes not allowed - "
*direct_nest_list*
Allow direct nesting of a list within another without requiring it to be a list item; see section:- #3.3.4
'0' - no *
'1' - yes
*elements*
Allowed HTML elements; see section:- #3.3
'* -center -dir -font -isindex -menu -s -strike -u' - ~
'applet, embed, iframe, object, script' not allowed - "
*hexdec_entity*
Allow hexadecimal numeric entities and do not convert to the more widely accepted decimal ones, or convert decimal to hexadecimal ones; see section:- #3.2
'0' - no
'1' - yes *
'2' - convert decimal to hexadecimal ones
*hook*
Name of an optional hook function to alter the input string, '$config' or '$spec' before htmLawed starts its main work; see section:- #3.7
'0' - no hook function *
`name` - `name` is name of the hook function ('kses_hook' ^)
*hook_tag*
Name of an optional hook function to alter tag content finalized by htmLawed; see section:- #3.4.9
'0' - no hook function *
`name` - `name` is name of the hook function
*keep_bad*
Neutralize bad tags by converting '<' and '>' to entities, or remove them; see section:- #3.3.3
'0' - remove ^
'1' - neutralize both tags and element content
'2' - remove tags but neutralize element content
'3' and '4' - like '1' and '2' but remove if text ('pcdata') is invalid in parent element
'5' and '6' * - like '3' and '4' but line-breaks, tabs and spaces are left
*lc_std_val*
For XHTML compliance, predefined, standard attribute values, like 'get' for the 'method' attribute of 'form', must be lowercased; see section:- #3.4.5
'0' - no
'1' - yes *
*make_tag_strict*
Transform/remove these non-strict XHTML elements, even if they are allowed by the admin: 'applet' 'center' 'dir' 'embed' 'font' 'isindex' 'menu' 's' 'strike' 'u'; see section:- #3.3.2
'0' - no ^
'1' - yes, but leave 'applet', 'embed' and 'isindex' elements that currently can't be transformed *
'2' - yes, removing 'applet', 'embed' and 'isindex' elements and their contents (nested elements remain) ~
*named_entity*
Allow non-universal named HTML entities, or convert to numeric ones; see section:- #3.2
'0' - convert
'1' - allow *
*no_deprecated_attr*
Allow deprecated attributes or transform them; see section:- #3.4.6
'0' - allow ^
'1' - transform, but 'name' attributes for 'a' and 'map' are retained *
'2' - transform
*parent*
Name of the parent element, possibly imagined, that will hold the input; see section:- #3.3
*safe*
Magic parameter to make input the most secure against XSS without needing to specify other relevant '$config' parameters; see section:- #3.6
'0' - no *
'1' - will auto-adjust other relevant '$config' parameters (indicated by '"' in this list)
*schemes*
Array of attribute-specific, comma-separated, lower-cased list of schemes (protocols) allowed in attributes accepting URLs (or '!' to `deny` any URL); '*' covers all unspecified attributes; see section:- #3.4.3
'href: aim, feed, file, ftp, gopher, http, https, irc, mailto, news, nntp, sftp, ssh, telnet; *:file, http, https' *
'*: ftp, gopher, http, https, mailto, news, nntp, telnet' ^
'href: aim, feed, file, ftp, gopher, http, https, irc, mailto, news, nntp, sftp, ssh, telnet; style: !; *:file, http, https' "
*show_setting*
Name of a PHP variable to assign the `finalized` '$config' and '$spec' values; see section:- #3.8
*style_pass*
Do not look at 'style' attribute values, letting them through without any alteration
'0' - no *
'1' - htmLawed will let through any 'style' value; see section:- #3.4.8
*tidy*
Beautify or compact HTML code; see section:- #3.3.5
'-1' - compact
'0' - no *
'1' or 'string' - beautify (custom format specified by 'string')
*unique_ids*
'id' attribute value checks; see section:- #3.4.2
'0' - no ^
'1' - remove duplicate and/or invalid ones *
`word` - remove invalid ones and replace duplicate ones with new and unique ones based on the `word`; the admin-specified `word`, like 'my_', should begin with a letter (a-z) and can contain letters, digits, '.', '_', '-', and ':'.
*valid_xhtml*
Magic parameter to make input the most valid XHTML without needing to specify other relevant '$config' parameters; see section:- #3.5
'0' - no *
'1' - will auto-adjust other relevant '$config' parameters (indicated by '~' in this list)
*xml:lang*
Auto-adding 'xml:lang' attribute; see section:- #3.4.1
'0' - no *
'1' - add if 'lang' attribute is present
'2' - add if 'lang' attribute is present, and remove 'lang' ~
-- 2.3 Extra HTML specifications using the $spec parameter --------o
The '$spec' argument of htmLawed can be used to disallow an otherwise legal attribute for an element, or to restrict the attribute's values. This can also be helpful as a security measure (e.g., in certain versions of browsers, certain values can cause buffer overflows and denial of service attacks), or in enforcing admin policies. '$spec' is specified as a string of text containing one or more `rules`, with multiple rules separated from each other by a semi-colon (';'). E.g.,
$spec = 'i=-*; td, tr=style, id, -*; a=id(match="/[a-z][a-z\d.:\-`"]*/i"/minval=2), href(maxlen=100/minlen=34); img=-width,-alt';
$processed = htmLawed($text, $config, $spec);
Or,
$processed = htmLawed($text, $config, 'i=-*; td, tr=style, id, -*; a=id(match="/[a-z][a-z\d.:\-`"]*/i"/minval=2), href(maxlen=100/minlen=34); img=-width,-alt');
A rule begins with an HTML *element* name(s) (`rule-element`), for which the rule applies, followed by an equal ('=') sign. A rule-element may represent multiple elements if comma (,)-separated element names are used. E.g., 'th,td,tr='.
Rest of the rule consists of comma-separated HTML *attribute names*. A minus ('-') character before an attribute means that the attribute is not permitted inside the rule-element. E.g., '-width'. To deny all attributes, '-*' can be used.
Following shows examples of rule excerpts with rule-element 'a' and the attributes that are being permitted:
* 'a=' - all
* 'a=id' - all
* 'a=href, title, -id, -onclick' - all except 'id' and 'onclick'
* 'a=*, id, -id' - all except 'id'
* 'a=-*' - none
* 'a=-*, href, title' - none except 'href' and 'title'
* 'a=-*, -id, href, title' - none except 'href' and 'title'
Rules regarding *attribute values* are optionally specified inside round brackets after attribute names in slash ('/')-separated `parameter = value` pairs. E.g., 'title(maxlen=30/minlen=5)'. None or one or more of the following parameters may be specified:
* 'oneof' - one or more choices separated by '|' that the value should match; if only one choice is provided, then the value must match that choice
* 'noneof' - one or more choices separated by '|' that the value should not match
* 'maxlen' and 'minlen' - upper and lower limits for the number of characters in the attribute value; specified in numbers
* 'maxval' and 'minval' - upper and lower limits for the numerical value specified in the attribute value; specified in numbers
* 'match' and 'nomatch' - pattern that the attribute value should or should not match; specified as PHP/PCRE-compatible regular expressions with delimiters and possibly modifiers
* 'default' - a value to force on the attribute if the value provided by the writer does not fit any of the specified parameters
If 'default' is not set and the attribute value does not satisfy any of the specified parameters, then the attribute is removed. The 'default' value can also be used to force all attribute declarations to take the same value (by getting the values declared illegal by setting, e.g., 'maxlen' to '-1').
Examples with `input` '<input title="WIDTH" value="10em" /><input title="length" value="5" />' are shown below.
`Rule`: 'input=title(maxlen=60/minlen=6), value'
`Output`: '<input value="10em" /><input title="length" value="5" />'
`Rule`: 'input=title(), value(maxval=8/default=6)'
`Output`: '<input title="WIDTH" value="6" /><input title="length" value="5" />'
`Rule`: 'input=title(nomatch=%w.d%i), value(match=%em%/default=6em)'
`Output`: '<input value="10em" /><input title="length" value="6em" />'
`Rule`: 'input=title(oneof=height|depth/default=depth), value(noneof=5|6)'
`Output`: '<input title="depth" value="10em" /><input title="depth" />'
*Special characters*: The characters ';', ',', '/', '(', ')', '|', '~' and space have special meanings in the rules. Words in the rules that use such characters, or the characters themselves, should be `escaped` by enclosing in pairs of double-quotes ('"'). A back-tick ('`') can be used to escape a literal '"'. An example rule illustrating this is 'input=value(maxlen=30/match="/^\w/"/default="your `"ID`"")'.
*Note*: To deny an attribute for all elements for which it is legal, '$config["deny_attribute"]' (see section:- #3.4) can be used instead of '$spec'. Also, attributes can be allowed element-specifically through '$spec' while being denied globally through '$config["deny_attribute"]'. The 'hook_tag' parameter (section:- #3.4.9) can also be possibly used to implement a functionality like that achieved using '$spec' functionality.
'$spec' can also be used to permit custom, non-standard attributes as well as custom rules for standard attributes. Thus, the following value of '$spec' will permit the custom uses of the standard 'rel' attribute in 'input' (not permitted as per standards) and of a non-standard attribute, 'vFlag', in 'img'.
$spec = 'img=vFlag; input=rel'
The attribute names can contain alphabets, colons (:) and hyphens (-), but they must start with an alphabet.
-- 2.4 Performance time & memory usage ----------------------------o
The time and memory consumed during text processing by htmLawed depends on its configuration, the size of the input, and the amount, nestedness and well-formedness of the HTML markup within the input. In particular, tag balancing and beautification each can increase the processing time by about a quarter.
The htmLawed demo:- htmLawedTest.php can be used to evaluate the performance and effects of different types of input and '$config'.
-- 2.5 Some security risks to keep in mind ------------------------o
When setting the parameters/arguments (like those to allow certain HTML elements) for use with htmLawed, one should bear in mind that the setting may let through potentially `dangerous` HTML code which is meant to steal user-data, deface a website, render a page non-functional, etc. Unless end-users, either people or software, supplying the content are completely trusted, security issues arising from the degree of HTML usage permitted through htmLawed's setting should be considered. For example, following increase security risks:
* Allowing 'script', 'applet', 'embed', 'iframe' or 'object' elements, or certain of their attributes like 'allowscriptaccess'
* Allowing HTML comments (some Internet Explorer versions are vulnerable with, e.g., '<!--[if gte IE 4]><script>alert("xss");</script><![endif]-->'
* Allowing dynamic CSS expressions (some Internet Explorer versions are vulnerable)
* Allowing the 'style' attribute
To remove `unsecure` HTML, code-developers using htmLawed must set '$config' appropriately. E.g., '$config["elements"] = "* -script"' to deny the 'script' element (section:- #3.3), '$config["safe"] = 1' to auto-configure ceratin htmLawed parameters for maximizing security (section:- #3.6), etc.
Permitting the '*style*' attribute brings in risks of `click-jacking`, `phishing`, web-page overlays, etc., `even` when the 'safe' parameter is enabled (see section:- #3.6). Except for URLs and a few other things like CSS dynamic expressions, htmLawed currently does not check every CSS style property. It does provide ways for the code-developer implementing htmLawed to do such checks through htmLawed's '$spec' argument, and through the 'hook_tag' parameter (see section:- #3.4.8 for more). Disallowing 'style' completely and relying on CSS classes and stylesheet files is recommended.
htmLawed does not check or correct the character *encoding* of the input it receives. In conjunction with permissive circumstances, such as when the character encoding is left undefined through HTTP headers or HTML 'meta' tags, this can allow for an exploit (like Google's `UTF-7/XSS` vulnerability of the past).
-- 2.6 Use without modifying old 'kses()' code --------------------o
The 'Kses' PHP script is used by many applications (like 'WordPress'). It is possible to have such applications use htmLawed instead, since it is compatible with code that calls the 'kses()' function declared in the 'Kses' file (usually named 'kses.php'). E.g., application code like this will continue to work after replacing 'Kses' with htmLawed:
$comment_filtered = kses($comment_input, array('a'=>array(), 'b'=>array(), 'i'=>array()));
For some of the '$config' parameters, htmLawed will use values other than the default ones. These are indicated by '^' in section:- #2.2. To force htmLawed to use other values, function 'kses()' in the htmLawed code should be edited -- a few configurable parameters/variables need to be changed.
If the application uses a 'Kses' file that has the 'kses()' function declared, then, to have the application use htmLawed instead of 'Kses', simply rename 'htmLawed.php' (to 'kses.php', e.g.) and replace the 'Kses' file (or just replace the code in the 'Kses' file with the htmLawed code). If the 'kses()' function in the 'Kses' file had been renamed by the application developer (e.g., in 'WordPress', it is named 'wp_kses()'), then appropriately rename the 'kses()' function in the htmLawed code.
If the 'Kses' file used by the application has been highly altered by the application developers, then one may need a different approach. E.g., with 'WordPress', it is best to copy the htmLawed code to 'wp_includes/kses.php', rename the newly added function 'kses()' to 'wp_kses()', and delete the code for the original 'wp_kses()' function.
If the 'Kses' code has a non-empty hook function (e.g., 'wp_kses_hook()' in case of 'WordPress'), then the code for htmLawed's 'kses_hook()' function should be appropriately edited. However, the requirement of the hook function should be re-evaluated considering that htmLawed has extra capabilities. With 'WordPress', the hook function is an essential one. The following code is suggested for the htmLawed 'kses_hook()' in case of 'WordPress':
function kses_hook($string, &$cf, &$spec){
// kses compatibility
$allowed_html = $spec;
$allowed_protocols = array();
foreach($cf['schemes'] as $v){
foreach($v as $k2=>$v2){
if(!in_array($k2, $allowed_protocols)){
$allowed_protocols[] = $k2;
}
}
}
return wp_kses_hook($string, $allowed_html, $allowed_protocols);
// eof
}
-- 2.7 Tolerance for ill-written HTML -----------------------------o
htmLawed can work with ill-written HTML code in the input. However, HTML that is too ill-written may not be `read` as HTML, and may therefore get identified as mere plain text. Following statements indicate the degree of `looseness` that htmLawed can work with, and can be provided in instructions to writers:
* Tags must be flanked by '<' and '>' with no '>' inside -- any needed '>' should be put in as '>'. It is possible for tag content (element name and attributes) to be spread over many lines instead of being on one. A space may be present between the tag content and '>', like '<div >' and '<img / >', but not after the '<'.
* Element and attribute names need not be lower-cased.
* Attribute string of elements may be liberally spaced with tabs, line-breaks, etc.
* Attribute values may be single- and not double-quoted.
* Left-padding of numeric entities (like, ' ', '&x07ff;') with '0' is okay as long as the number of characters between between the '&' and the ';' does not exceed 8. All entities must end with ';' though.
* Named character entities must be properly cased. Thus, '≪' or '&TILDE;' will not be recognized as entities and will be `neutralized`.
* HTML comments should not be inside element tags (they can be between tags), and should begin with '<!--' and end with '-->'. Characters like '<', '>', and '&' may be allowed inside depending on '$config', but any '-->' inside should be put in as '-->'. Any '--' inside will be automatically converted to '-', and a space will be added before the comment delimiter '-->'.
* 'CDATA' sections should not be inside element tags, and can be in element content only if plain text is allowed for that element. They should begin with '<[CDATA[' and end with ']]>'. Characters like '<', '>', and '&' may be allowed inside depending on '$config', but any ']]>' inside should be put in as ']]>'.
* For attribute values, character entities '<', '>' and '&' should be used instead of characters '<' and '>', and '&' (when '&' is not part of a character entity). This applies even for Javascript code in values of attributes like 'onclick'.
* Characters '<', '>', '&' and '"' that are part of actual Javascript, etc., code in 'script' elements should be used as such and not be put in as entities like '>'. Otherwise, though the HTML will be valid, the code may fail to work. Further, if such characters have to be used, then they should be put inside 'CDATA' sections.
* Simple instructions like "an opening tag cannot be present between two closing tags" and "nested elements should be closed in the reverse order of how they were opened" can help authors write balanced HTML. If tags are imbalanced, htmLawed will try to balance them, but in the process, depending on '$config["keep_bad"]', some code/text may be lost.
* Input authors should be notified of admin-specified allowed elements, attributes, configuration values (like conversion of named entities to numeric ones), etc.
* With '$config["unique_ids"]' not '0' and the 'id' attribute being permitted, writers should carefully avoid using duplicate or invalid 'id' values as even though htmLawed will correct/remove the values, the final output may not be the one desired. E.g., when '<a id="home"></a><input id="home" /><label for="home"></label>' is processed into
'<a id="home"></a><input id="prefix_home" /><label for="home"></label>'.
* Even if intended HTML is lost from an ill-written input, the processed output will be more secure and standard-compliant.
* For URLs, unless '$config["scheme"]' is appropriately set, writers should avoid using escape characters or entities in schemes. E.g., 'http' (which many browsers will read as the harmless 'http') may be considered bad by htmLawed.
* htmLawed will attempt to put plain text present directly inside 'blockquote', 'form', 'map' and 'noscript' elements (illegal as per the specifications) inside auto-generated 'div' elements.
-- 2.8 Limitations & work-arounds ---------------------------------o
htmLawed's main objective is to make the input text `more` standard-compliant, secure for readers, and free of HTML elements and attributes considered undesirable by the administrator. Some of its current limitations, regardless of this objective, are noted below along with work-arounds.
It should be borne in mind that no browser application is 100% standard-compliant, and that some of the standard specifications (like asking for normalization of white-spacing within 'textarea' elements) are clearly wrong. Regarding security, note that `unsafe` HTML code is not legally invalid per se.
* htmLawed is meant for input that goes into the 'body' of HTML documents. HTML's head-level elements are not supported, nor are the frameset elements 'frameset', 'frame' and 'noframes'. Content of the latter elements can, however, be individually filtered through htmLawed.
* It cannot transform the non-standard 'embed' elements to the standard-compliant 'object' elements. Yet, it can allow 'embed' elements if permitted ('embed' is widely used and supported). Admins can certainly use the 'hook_tag' parameter (section:- #3.4.9) to deploy a custom embed-to-object converter function.
* The only non-standard element that may be permitted is 'embed'; others like 'noembed' and 'nobr' cannot be permitted without modifying the htmLawed code.
* It cannot handle input that has non-HTML code like 'SVG' and 'MathML'. One way around is to break the input into pieces and passing only those without non-HTML code to htmLawed. Another is described in section:- #3.9. A third way may be to some how take advantage of the '$config["and_mark"]' parameter (see section:- #3.2).
* By default, htmLawed won't check many attribute values for standard compliance. E.g., 'width="20m"' with the dimension in non-standard 'm' is let through. Implementing universal and strict attribute value checks can make htmLawed slow and resource-intensive. Admins should look at the 'hook_tag' parameter (section:- #3.4.9) or '$spec' to enforce finer checks.
* The attributes, deprecated (which can be transformed too) or not, that it supports are largely those that are in the specifications. Only a few of the proprietary attributes are supported.
* Except for contained URLs and dynamic expressions (also optional), htmLawed does not check CSS style property values. Admins should look at using the 'hook_tag' parameter (section:- #3.4.9) or '$spec' for finer checks. Perhaps the best option is to disallow 'style' but allow 'class' attributes with the right 'oneof' or 'match' values for 'class', and have the various class style properties in '.css' CSS stylesheet files.
* htmLawed does not parse emoticons, decode `BBcode`, or `wikify`, auto-converting text to proper HTML. Similarly, it won't convert line-breaks to 'br' elements. Such functions are beyond its purview. Admins should use other code to pre- or post-process the input for such purposes.
* htmLawed cannot be used to have links force-opened in new windows (by auto-adding appropriate 'target' and 'onclick' attributes to 'a'). Admins should look at Javascript-based DOM-modifying solutions for this. Admins may also be able to use a custom hook function to enforce such checks ('hook_tag' parameter; see section:- #3.4.9).
* Nesting-based checks are not possible. E.g., one cannot disallow 'p' elements specifically inside 'td' while permitting it elsewhere. Admins may be able to use a custom hook function to enforce such checks ('hook_tag' parameter; see section:- #3.4.9).
* Except for optionally converting absolute or relative URLs to the other type, htmLawed will not alter URLs (e.g., to change the value of query strings or to convert 'http' to 'https'. Having absolute URLs may be a standard-requirement, e.g., when HTML is embedded in email messages, whereas altering URLs for other purposes is beyond htmLawed's goals. Admins may be able to use a custom hook function to enforce such checks ('hook_tag' parameter; see section:- #3.4.9).
* Pairs of opening and closing tags that do not enclose any content (like '<em></em>') are not removed. This may be against the standard specifications for certain elements (e.g., 'table'). However, presence of such standard-incompliant code will not break the display or layout of content. Admins can also use simple regex-based code to filter out such code.
* htmLawed does not check for certain element orderings described in the standard specifications (e.g., in a 'table', 'tbody' is allowed before 'tfoot'). Admins may be able to use a custom hook function to enforce such checks ('hook_tag' parameter; see section:- #3.4.9).
* htmLawed does not check the number of nested elements. E.g., it will allow two 'caption' elements in a 'table' element, illegal as per the specifications. Admins may be able to use a custom hook function to enforce such checks ('hook_tag' parameter; see section:- #3.4.9).
* htmLawed might convert certain entities to actual characters and remove backslashes and CSS comment-markers ('/*') in 'style' attribute values in order to detect malicious HTML like crafted IE-specific dynamic expressions like 'expression...'. If this is too harsh, admins can allow CSS expressions through htmLawed core but then use a custom function through the 'hook_tag' parameter (section:- #3.4.9) to more specifically identify CSS expressions in the 'style' attribute values. Also, using '$config["style_pass"]', it is possible to have htmLawed pass 'style' attribute values without even looking at them (section:- #3.4.8).
* htmLawed does not correct certain possible attribute-based security vulnerabilities (e.g., '<a href="http://x%22+style=%22background-image:xss">x</a>'). These arise when browsers mis-identify markup in `escaped` text, defeating the very purpose of escaping text (a bad browser will read the given example as '<a href="http://x" style="background-image:xss">x</a>').
* Because of poor Unicode support in PHP, htmLawed does not remove the `high value` HTML-invalid characters with multi-byte code-points. Such characters however are extremely unlikely to be in the input. (see section:- #3.1).
* htmLawed does not check or correct the character encoding of the input it receives. In conjunction with permitting circumstances such as when the character encoding is left undefined through HTTP headers or HTML 'meta' tags, this can permit an exploit (like Google's `UTF-7/XSS` vulnerability of the past). Also, htmLawed can mangle input text if it is not well-formed in terms of character encoding. Administrators can consider using code available elsewhere to check well-formedness of input text characters to correct any defect.
* htmLawed is expected to work with input texts in ASCII-compatible single byte encodings such as national variants of ASCII (like ISO-646-DE/German of the ISO 646 standard), extended ASCII variants (like ISO 8859-10/Turkish of the ISO 8859/ISO Latin standard), ISO 8859-based Windows variants (like Windows 1252), EBCDIC, Shift JIS (Japanese), GB-Roman (Chinese), and KS-Roman (Korean). It should also properly handle texts with variable byte encodings like UTF-7 (Unicode) and UTF-8 (Unicode). However, htmLawed may mangle input texts with double byte encodings like UTF-16 (Unicode), JIS X 0208:1997 (Japanese) and K SX 1001:1992 (Korean), or the UTF-32 (Unicode) quadruple byte encoding. If an input text has such an encoding, administrators can use PHP's iconv:- http://php.net/manual/en/book.iconv.php functions, or some other mean, to convert text to UTF-8 before passing it to htmLawed.
* Like any script using PHP's PCRE regex functions, PHP setup-specific low PCRE limit values can cause htmLawed to at least partially fail with very long input texts.
-- 2.9 Examples of usage -------------------------------------------o
Safest, allowing only `safe` HTML markup --
$config = array('safe'=>1);
$out = htmLawed($in);
Simplest, allowing all valid HTML markup except 'javascript:' --
$out = htmLawed($in);
Allowing all valid HTML markup including 'javascript:' --
$config = array('schemes'=>'*:*');
$out = htmLawed($in, $config);
Allowing only 'safe' HTML and the elements 'a', 'em', and 'strong' --
$config = array('safe'=>1, 'elements'=>'a, em, strong');
$out = htmLawed($in, $config);
Not allowing elements 'script' and 'object' --
$config = array('elements'=>'* -script -object');
$out = htmLawed($in, $config);
Not allowing attributes 'id' and 'style' --
$config = array('deny_attribute'=>'id, style');
$out = htmLawed($in, $config);
Permitting only attributes 'title' and 'href' --
$config = array('deny_attribute'=>'* -title -href');
$out = htmLawed($in, $config);
Remove bad/disallowed tags altogether instead of converting them to entities --
$config = array('keep_bad'=>0);
$out = htmLawed($in, $config);
Allowing attribute 'title' only in 'a' and not allowing attributes 'id', 'style', or scriptable `on*` attributes like 'onclick' --
$config = array('deny_attribute'=>'title, id, style, on*');
$spec = 'a=title';
$out = htmLawed($in, $config, $spec);
Allowing a custom attribute, 'vFlag', in 'img' and permitting custom use of the standard attribute, 'rel', in 'input' --
$spec = 'img=vFlag; input=rel';
$out = htmLawed($in, $config, $spec);
Some case-studies are presented below.
*1.* A blog administrator wants to allow only 'a', 'em', 'strike', 'strong' and 'u' in comments, but needs 'strike' and 'u' transformed to 'span' for better XHTML 1-strict compliance, and, he wants the 'a' links to point only to 'http' or 'https' resources:
$processed = htmLawed($in, array('elements'=>'a, em, strike, strong, u', 'make_tag_strict'=>1, 'safe'=>1, 'schemes'=>'*:http, https'), 'a=href');
*2.* An author uses a custom-made web application to load content on his web-site. He is the only one using that application and the content he generates has all types of HTML, including scripts. The web application uses htmLawed primarily as a tool to correct errors that creep in while writing HTML and to take care of the occasional `bad` characters in copy-paste text introduced by Microsoft Office. The web application provides a preview before submitted input is added to the content. For the previewing process, htmLawed is set up as follows:
$processed = htmLawed($in, array('css_expression'=>1, 'keep_bad'=>1, 'make_tag_strict'=>1, 'schemes'=>'*:*', 'valid_xhtml'=>1));
For the final submission process, 'keep_bad' is set to '6'. A value of '1' for the preview process allows the author to note and correct any HTML mistake without losing any of the typed text.
*3.* A data-miner is scraping information in a specific table of similar web-pages and is collating the data rows, and uses htmLawed to reduce unnecessary markup and white-spaces:
$processed = htmLawed($in, array('elements'=>'tr, td', 'tidy'=>-1), 'tr, td =');
== 3 Details =====================================================oo
-- 3.1 Invalid/dangerous characters --------------------------------
Valid characters (more correctly, their code-points) in HTML or XML are, hexadecimally, '9', 'a', 'd', '20' to 'd7ff', and 'e000' to '10ffff', except 'fffe' and 'ffff' (decimally, '9', '10', '13', '32' to '55295', and '57344' to '1114111', except '65534' and '65535'). htmLawed removes the invalid characters '0' to '8', 'b', 'c', and 'e' to '1f'.
Because of PHP's poor native support for multi-byte characters, htmLawed cannot check for the remaining invalid code-points. However, for various reasons, it is very unlikely for any of those characters to be in the input.
Characters that are discouraged (see section:- #5.1) but not invalid are not removed by htmLawed.
It (function 'hl_tag()') also replaces the potentially dangerous (in some Mozilla [Firefox] and Opera browsers) soft-hyphen character (code-point, hexadecimally, 'ad', or decimally, '173') in attribute values with spaces. Where required, the characters '<', '>', '&', and '"' are converted to entities.
With '$config["clean_ms_char"]' set as '1' or '2', many of the discouraged characters (decimal code-points '127' to '159' except '133') that many Microsoft applications incorrectly use (as per the 'Windows 1252' ['Cp-1252'] or a similar encoding system), and the character for decimal code-point '133', are converted to appropriate decimal numerical entities (or removed for a few cases)-- see appendix in section:- #5.4. This can help avoid some display issues arising from copying-pasting of content.
With '$config["clean_ms_char"]' set as '2', characters for the hexadecimal code-points '82', '91', and '92' (for special single-quotes), and '84', '93', and '94' (for special double-quotes) are converted to ordinary single and double quotes respectively and not to entities.
The character values are replaced with entities/characters and not character values referred to by the entities/characters to keep this task independent of the character-encoding of input text.
The '$config["clean_ms_char"]' parameter should not be used if authors do not copy-paste Microsoft-created text, or if the input text is not believed to use the 'Windows 1252' ('Cp-1252') or a similar encoding like 'Cp-1251' (otherwise, for example when UTF-8 encoding is in use, Japanese or Korean characters can get mangled). Further, the input form and the web-pages displaying it or its content should have the character encoding appropriately marked-up.
-- 3.2 Character references/entities ------------------------------o
Valid character entities take the form '&*;' where '*' is '#x' followed by a hexadecimal number (hexadecimal numeric entity; like ' ' for non-breaking space), or alphanumeric like 'gt' (external or named entity; like ' ' for non-breaking space), or '#' followed by a number (decimal numeric entity; like ' ' for non-breaking space). Character entities referring to the soft-hyphen character (the '­' or '\xad' character; hexadecimal code-point 'ad' [decimal '173']) in URL-accepting attribute values are always replaced with spaces; soft-hyphens in attribute values introduce vulnerabilities in some older versions of the Opera and Mozilla [Firefox] browsers.
htmLawed (function 'hl_ent()'):
* Neutralizes entities with multiple leading zeroes or missing semi-colons (potentially dangerous)
* Lowercases the 'X' (for XML-compliance) and 'A-F' of hexadecimal numeric entities
* Neutralizes entities referring to characters that are HTML-invalid (see section:- #3.1)
* Neutralizes entities referring to characters that are HTML-discouraged (code-points, hexadecimally, '7f' to '84', '86' to '9f', and 'fdd0' to 'fddf', or decimally, '127' to '132', '134' to '159', and '64991' to '64976'). Entities referring to the remaining discouraged characters (see section:- #5.1 for a full list) are let through.
* Neutralizes named entities that are not in the specs.
* Optionally converts valid HTML-specific named entities except '>', '<', '"', and '&' to decimal numeric ones (hexadecimal if $config["hexdec_entity"] is '2') for generic XML-compliance. For this, '$config["named_entity"]' should be '1'.
* Optionally converts hexadecimal numeric entities to the more widely supported decimal ones. For this, '$config["hexdec_entity"]' should be '0'.
* Optionally converts decimal numeric entities to the hexadecimal ones. For this, '$config["hexdec_entity"]' should be '2'.
`Neutralization` refers to the `entitification` of '&' to '&'.
*Note*: htmLawed does not convert entities to the actual characters represented by them; one can pass the htmLawed output through PHP's 'html_entity_decode' function:- http://www.php.net/html_entity_decode for that.
*Note*: If '$config["and_mark"]' is set, and set to a value other than '0', then the '&' characters in the original input are replaced with the control character for the hexadecimal code-point '6' ('\x06'; '&' characters introduced by htmLawed, e.g., after converting '<' to '<', are not affected). This allows one to distinguish, say, an '>' introduced by htmLawed and an '>' put in by the input writer, and can be helpful in further processing of the htmLawed-processed text (e.g., to identify the character sequence 'o(><)o' to generate an emoticon image). When this feature is active, admins should ensure that the htmLawed output is not directly used in web pages or XML documents as the presence of the '\x06' can break documents. Before use in such documents, and preferably before any storage, any remaining '\x06' should be changed back to '&', e.g., with:
$final = str_replace("\x06", '&', $prelim);
Also, see section:- #3.9.
-- 3.3 HTML elements ----------------------------------------------o
htmLawed can be configured to allow only certain HTML elements (tags) in the input. Disallowed elements (just tag-content, and not element-content), based on '$config["keep_bad"]', are either `neutralized` (converted to plain text by entitification of '<' and '>') or removed.
E.g., with only 'em' permitted:
Input:
<em>My</em> website is <a href="http://a.com>a.com</a>.
Output, with '$config["keep_bad"] = 0':
<em>My</em> website is a.com.
Output, with '$config["keep_bad"]' not '0':
<em>My</em> website is <a href="">a.com</a>.
See section:- #3.3.3 for differences between the various non-zero '$config["keep_bad"]' values.
htmLawed by default permits these 86 elements:
a, abbr, acronym, address, applet, area, b, bdo, big, blockquote, br, button, caption, center, cite, code, col, colgroup, dd, del, dfn, dir, div, dl, dt, em, embed, fieldset, font, form, h1, h2, h3, h4, h5, h6, hr, i, iframe, img, input, ins, isindex, kbd, label, legend, li, map, menu, noscript, object, ol, optgroup, option, p, param, pre, q, rb, rbc, rp, rt, rtc, ruby, s, samp, script, select, small, span, strike, strong, sub, sup, table, tbody, td, textarea, tfoot, th, thead, tr, tt, u, ul, var
Except for 'embed' (included because of its wide-spread use) and the Ruby elements ('rb', 'rbc', 'rp', 'rt', 'rtc', 'ruby'; part of XHTML 1.1), these are all the elements in the HTML 4/XHTML 1 specs. Strict-specific specs. exclude 'center', 'dir', 'font', 'isindex', 'menu', 's', 'strike', and 'u'.
With '$config["safe"] = 1', the default set will exclude 'applet', 'embed', 'iframe', 'object' and 'script'; see section:- #3.6.
When '$config["elements"]', which specifies allowed elements, is `properly` defined, and neither empty nor set to '0' or '*', the default set is not used. To have elements added to or removed from the default set, a '+/-' notation is used. E.g., '*-script-object' implies that only 'script' and 'object' are disallowed, whereas '*+embed' means that 'noembed' is also allowed. Elements can also be specified as comma separated names. E.g., 'a, b, i' means only 'a', 'b' and 'i' are permitted. In this notation, '*', '+' and '-' have no significance and can actually cause a mis-reading.
Some more examples of '$config["elements"]' values indicating permitted elements (note that empty spaces are liberally allowed for clarity):
* 'a, blockquote, code, em, strong' -- only 'a', 'blockquote', 'code', 'em', and 'strong'
* '*-script' -- all excluding 'script'
* '* -center -dir -font -isindex -menu -s -strike -u' -- only XHTML-Strict elements
* '*+noembed-script' -- all including 'noembed' excluding 'script'
Some mis-usages (and the resulting permitted elements) that can be avoided:
* '-*' -- none; instead of htmLawed, one might just use, e.g., the 'htmlspecialchars()' PHP function
* '*, -script' -- all except 'script'; admin probably meant '*-script'
* '-*, a, em, strong' -- all; admin probably meant 'a, em, strong'
* '*' -- all; admin need not have set 'elements'
* '*-form+form' -- all; a '+' will always over-ride any '-'
* '*, noembed' -- only 'noembed'; admin probably meant '*+noembed'
* 'a, +b, i' -- only 'a' and 'i'; admin probably meant 'a, b, i'
Basically, when using the '+/-' notation, commas (',') should not be used, and vice versa, and '*' should be used with the former but not the latter.
*Note*: Even if an element that is not in the default set is allowed through '$config["elements"]', like 'noembed' in the last example, it will eventually be removed during tag balancing unless such balancing is turned off ('$config["balance"]' set to '0'). Currently, the only way around this, which actually is simple, is to edit the various arrays in the function 'hl_bal()' to accommodate the element and its nesting properties.
*A possibly second way to specify allowed elements* is to set '$config["parent"]' to an element name that supposedly will hold the input, and to set '$config["balance"]' to '1'. During tag balancing (see section:- #3.3.3), all elements that cannot legally nest inside the parent element will be removed. The parent element is auto-reset to 'div' if '$config["parent"]' is empty, 'body', or an element not in htmLawed's default set of 86 elements.
`Tag transformation` is possible for improving XHTML-Strict compliance -- most of the deprecated elements are removed or converted to valid XHTML-Strict ones; see section:- #3.3.2.
.. 3.3.1 Handling of comments and CDATA sections ...................
'CDATA' sections have the format '<![CDATA[...anything but not "]]>"...]]>', and HTML comments, '<!--...anything but not "-->"... -->'. Neither HTML comments nor 'CDATA' sections can reside inside tags. HTML comments can exist anywhere else, but 'CDATA' sections can exist only where plain text is allowed (e.g., immediately inside 'td' element content but not immediately inside 'tr' element content).
htmLawed (function 'hl_cmtcd()') handles HTML comments or 'CDATA' sections depending on the values of '$config["comment"]' or '$config["cdata"]'. If '0', such markup is not looked for and the text is processed like plain text. If '1', it is removed completely. If '2', it is preserved but any '<', '>' and '&' inside are changed to entities. If '3', they are left as such.
Note that for the last two cases, HTML comments and 'CDATA' sections will always be removed from tag content (function 'hl_tag()').
Examples:
Input:
<!-- home link --><a href="home.htm"><![CDATA[x=&y]]>Home</a>
Output ('$config["comment"] = 0, $config["cdata"] = 2'):
<-- home link --><a href="home.htm"><![CDATA[x=&y]]>Home</a>
Output ('$config["comment"] = 1, $config["cdata"] = 2'):
<a href="home.htm"><![CDATA[x=&y]]>Home</a>
Output ('$config["comment"] = 2, $config["cdata"] = 2'):
<!-- home link --><a href="home.htm"><![CDATA[x=&y]]>Home</a>
Output ('$config["comment"] = 2, $config["cdata"] = 1'):
<!-- home link --><a href="home.htm">Home</a>
Output ('$config["comment"] = 3, $config["cdata"] = 3'):
<!-- home link --><a href="home.htm"><![CDATA[x=&y]]>Home</a>
For standard-compliance, comments are given the form '<!--comment -->', and any '--' in the content is made '-'.
When '$config["safe"] = 1', CDATA sections and comments are considered plain text unless '$config["comment"]' or '$config["cdata"]' is explicitly specified; see section:- #3.6.
.. 3.3.2 Tag-transformation for better XHTML-Strict ................o
If '$config["make_tag_strict"]' is set and not '0', following non-XHTML-Strict elements (and attributes), even if admin-permitted, are mutated as indicated (element content remains intact; function 'hl_tag2()'):
* applet - (based on '$config["make_tag_strict"]', unchanged ('1') or removed ('2'))
* center - 'div style="text-align: center;"'
* dir - 'ul'
* embed - (based on '$config["make_tag_strict"]', unchanged ('1') or removed ('2'))
* font (face, size, color) - 'span style="font-family: ; font-size: ; color: ;"' (size transformation reference:- http://style.cleverchimp.com/font_size_intervals/altintervals.html)
* isindex - (based on '$config["make_tag_strict"]', unchanged ('1') or removed ('2'))
* menu - 'ul'
* s - 'span style="text-decoration: line-through;"'
* strike - 'span style="text-decoration: line-through;"'
* u - 'span style="text-decoration: underline;"'
For an element with a pre-existing 'style' attribute value, the extra style properties are appended.
Example input:
<center>
The PHP <s>software</s> script used for this <strike>web-page</strike> web-page is <font style="font-weight: bold " face=arial size='+3' color = "red ">htmLawedTest.php</font>, from <u style= 'color:green'>PHP Labware</u>.
</center>
The output:
<div style="text-align: center;">
The PHP <span style="text-decoration: line-through;">software</span> script used for this <span style="text-decoration: line-through;">web-page</span> web-page is <span style="font-weight: bold; font-family: arial; color: red; font-size: 200%;">htmLawedTest.php</span>, from <span style="color:green; text-decoration: underline;">PHP Labware</span>.
</div>
-- 3.3.3 Tag balancing and proper nesting -------------------------o
If '$config["balance"]' is set to '1', htmLawed (function 'hl_bal()') checks and corrects the input to have properly balanced tags and legal element content (i.e., any element nesting should be valid, and plain text may be present only in the content of elements that allow them).
Depending on the value of '$config["keep_bad"]' (see section:- #2.2 and section:- #3.3), illegal content may be removed or neutralized to plain text by converting < and > to entities:
'0' - remove; this option is available only to maintain Kses-compatibility and should not be used otherwise (see section:- #2.6)
'1' - neutralize tags and keep element content
'2' - remove tags but keep element content
'3' and '4' - like '1' and '2', but keep element content only if text ('pcdata') is valid in parent element as per specs
'5' and '6' - like '3' and '4', but line-breaks, tabs and spaces are left
Example input (disallowing the 'p' element):
<*> Pseudo-tags <*>
<xml>Non-HTML tag xml</xml>
<p>
Disallowed tag p
</p>
<ul>Bad<li>OK</li></ul>
The output with '$config["keep_bad"] = 1':
<*> Pseudo-tags <*>
<xml>Non-HTML tag xml</xml>
<p>
Disallowed tag p
</p>
<ul>Bad<li>OK</li></ul>
The output with '$config["keep_bad"] = 3':
<*> Pseudo-tags <*>
<xml>Non-HTML tag xml</xml>
<p>
Disallowed tag p
</p>
<ul><li>OK</li></ul>
The output with '$config["keep_bad"] = 6':
<*> Pseudo-tags <*>
Non-HTML tag xml
Disallowed tag p
<ul><li>OK</li></ul>
An option like '1' is useful, e.g., when a writer previews his submission, whereas one like '3' is useful before content is finalized and made available to all.
*Note:* In the example above, unlike '<*>', '<xml>' gets considered as a tag (even though there is no HTML element named 'xml'). Thus, the 'keep_bad' parameter's value affects '<xml>' but not '<*>'. In general, text matching the regular expression pattern '<(/?)([a-zA-Z][a-zA-Z1-6]*)([^>]*?)\s?>' is considered a tag (phrase enclosed by the angled brackets '<' and '>', and starting [with an optional slash preceding] with an alphanumeric word that starts with an alphabet...), and is subjected to the 'keep_bad' value.
Nesting/content rules for each of the 86 elements in htmLawed's default set (see section:- #3.3) are defined in function 'hl_bal()'. This means that if a non-standard element besides 'embed' is being permitted through '$config["elements"]', the element's tag content will end up getting removed if '$config["balance"]' is set to '1'.
Plain text and/or certain elements nested inside 'blockquote', 'form', 'map' and 'noscript' need to be in block-level elements. This point is often missed during manual writing of HTML code. htmLawed attempts to address this during balancing. E.g., if the parent container is set as 'form', the input 'B:<input type="text" value="b" />C:<input type="text" value="c" />' is converted to '<div>B:<input type="text" value="b" />C:<input type="text" value="c" /></div>'.
-- 3.3.4 Elements requiring child elements ------------------------o
As per specs, the following elements require legal child elements nested inside them:
blockquote, dir, dl, form, map, menu, noscript, ol, optgroup, rbc, rtc, ruby, select, table, tbody, tfoot, thead, tr, ul
In some cases, the specs stipulate the number and/or the ordering of the child elements. A 'table' can have 0 or 1 'caption', 'tbody', 'tfoot', and 'thead', but they must be in this order: 'caption', 'thead', 'tfoot', 'tbody'.
htmLawed currently does not check for conformance to these rules. Note that any non-compliance in this regard will not introduce security vulnerabilities, crash browser applications, or affect the rendering of web-pages.
With '$config["direct_list_nest"]' set to '1', htmLawed will allow direct nesting of an 'ol' or 'ul' list within another 'ol' or 'ul' without requiring the child list to be within an 'li' of the parent list. While this is not standard-compliant, directly nested lists are rendered properly by almost all browsers. The parameter '$config["direct_list_nest"]' has no effect if tag-balancing (section:- #3.3.3) is turned off.
-- 3.3.5 Beautify or compact HTML ---------------------------------o
By default, htmLawed will neither `beautify` HTML code by formatting it with indentations, etc., nor will it make it compact by removing un-needed white-space.(It does always properly white-space tag content.)
As per the HTML standards, spaces, tabs and line-breaks in web-pages (except those inside 'pre' elements) are all considered equivalent, and referred to as `white-spaces`. Browser applications are supposed to consider contiguous white-spaces as just a single space, and to disregard white-spaces trailing opening tags or preceding closing tags. This white-space `normalization` allows the use of text/code beautifully formatted with indentations and line-spacings for readability. Such `pretty` HTML can, however, increase the size of web-pages, or make the extraction or scraping of plain text cumbersome.
With the '$config' parameter 'tidy', htmLawed can be used to beautify or compact the input text. Input with just plain text and no HTML markup is also subject to this. Besides 'pre', the 'script' and 'textarea' elements, CDATA sections, and HTML comments are not subjected to the tidying process.
To `compact`, use '$config["tidy"] = -1'; single instances or runs of white-spaces are replaced with a single space, and white-spaces trailing and leading open and closing tags, respectively, are removed.
To `beautify`, '$config["tidy"]' is set as '1', or for customized tidying, as a string like '2s2n'. The 's' or 't' character specifies the use of spaces or tabs for indentation. The first and third characters, any of the digits 0-9, specify the number of spaces or tabs per indentation, and any parental lead spacing (extra indenting of the whole block of input text). The 'r' and 'n' characters are used to specify line-break characters: 'n' for '\n' (Unix/Mac OS X line-breaks), 'rn' or 'nr' for '\r\n' (Windows/DOS line-breaks), or 'r' for '\r'.
The '$config["tidy"]' value of '1' is equivalent to '2s0n'. Other '$config["tidy"]' values are read loosely: a value of '4' is equivalent to '4s0n'; 't2', to '1t2n'; 's', to '2s0n'; '2TR', to '2t0r'; 'T1', to '1t1n'; 'nr3', to '3s0nr', and so on. Except in the indentations and line-spacings, runs of white-spaces are replaced with a single space during beautification.
Input formatting using '$config["tidy"]' is not recommended when input text has mixed markup (like HTML + PHP).
-- 3.4 Attributes ------------------------------------------------oo
htmLawed will only permit attributes described in the HTML specs (including deprecated ones). It also permits some attributes for use with the 'embed' element (the non-standard 'embed' element is supported in htmLawed because of its widespread use), and the 'allowfullscreen' (in 'iframe', because of its widespread use), 'bordercolor' (in 'table', 'td' and 'tr', because of its widespread use), and 'xml:space' (valid only in XHTML 1.1) attributes. A list of such 112 attributes and the elements they are allowed in is in section:- #5.2. Using the '$spec' argument, htmLawed can be forced to permit custom, non-standard attributes as well as custom rules for standard attributes (section:- #2.3).
When '$config["deny_attribute"]' is not set, or set to '0', or empty ('""'), all the 111 attributes are permitted. Otherwise, '$config["deny_attribute"]' can be set as a list of comma-separated names of the denied attributes. 'on*' can be used to refer to the group of potentially dangerous, script-accepting attributes: 'onblur', 'onchange', 'onclick', 'ondblclick', 'onfocus', 'onkeydown', 'onkeypress', 'onkeyup', 'onmousedown', 'onmousemove', 'onmouseout', 'onmouseover', 'onmouseup', 'onreset', 'onselect' and 'onsubmit'.
Note that attributes specified in '$config["deny_attribute"]' are denied globally, for all elements. To deny attributes for only specific elements, '$spec' (see section:- #2.3) can be used. '$spec' can also be used to element-specifically permit an attribute otherwise denied through '$config["deny_attribute"]'.
With '$config["safe"] = 1' (section:- #3.6), the 'on*' attributes are automatically disallowed.