Skip to content

Commit c2e12a9

Browse files
committed
Avoiding CDATA nonsense
1 parent 50571ac commit c2e12a9

File tree

2 files changed

+3
-5
lines changed

2 files changed

+3
-5
lines changed

requirements.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ charset-normalizer==3.3.2
1717
dateparser==1.1.6
1818
ebbe==1.13.2
1919
json5==0.9.11
20-
lxml>=4.9.2,<5
20+
lxml>=4.9.2
2121
nanoid==2.0.0
2222
playwright==1.35.0
2323
playwright_stealth==1.0.5

test/scraper_test.py

+2-4
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,8 @@
9797
</table>
9898
"""
9999

100+
101+
# NOTE: CDATA is handled very differently depending on lxml & bs4 versions
100102
THE_WORST_HTML = """
101103
<div>Some text isn't
102104
it?
@@ -136,8 +138,6 @@
136138
<li>Other</li>
137139
<li>Again</li>
138140
</ol>
139-
<p>
140-
<![CDATA[some very interesting stuff]]></p>
141141
<p>
142142
This is <span>a large span </span>
143143
with something else over <strong>here</strong>.
@@ -997,8 +997,6 @@ def clean(t):
997997
Other
998998
Again
999999
1000-
some very interesting stuff
1001-
10021000
This is a large span with something else over here.
10031001
10041002
Hello gorgeous!

0 commit comments

Comments
 (0)