stackoverflow exception when load pages has lots of nested <br> tags #415

MachineLearning666 · 2020-10-28T03:07:20Z

Description

There is a stackoverflow issue when load page has lots nested tags, i tried to set MaxDepthLevel=100, OptionMaxNestedChildNodes=50, OptionAutoCloseOnEnd=true, OptionFixNestedTags=true, but it didn't work

Exception thrown by the code htmlDoc.LoadHtml(html);

try
            {
                HtmlDocument.MaxDepthLevel = 100;
                htmlDoc = new HtmlDocument();

                // cause stack overflow issue if there is too many nested links
                htmlDoc.OptionMaxNestedChildNodes = 50;
                htmlDoc.OptionAutoCloseOnEnd = true;
                htmlDoc.OptionFixNestedTags = true;
                //htmlDoc.option
                htmlDoc.LoadHtml(html);

                // log, disable later
                Console.WriteLine("load page successfully");
            }
            catch (Exception)
            {
                return;
            }

Exception

StackOverflow

Exception message:
Stack trace:
'InfoExtractor.exe' (CLR v4.0.30319: DefaultDomain): Loaded 'C:\WINDOWS\Microsoft.Net\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled.
'InfoExtractor.exe' (CLR v4.0.30319: DefaultDomain): Loaded 'D:\ListNavRepo\ListNav\InfoExtractor\bin\Debug\InfoExtractor.exe'. Symbols loaded.
'InfoExtractor.exe' (CLR v4.0.30319: InfoExtractor.exe): Loaded 'C:\WINDOWS\Microsoft.Net\assembly\GAC_MSIL\System\v4.0_4.0.0.0__b77a5c561934e089\System.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled.
'InfoExtractor.exe' (CLR v4.0.30319: InfoExtractor.exe): Loaded 'D:\ListNavRepo\ListNav\InfoExtractor\bin\Debug\HtmlAgilityPack.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled.
'InfoExtractor.exe' (CLR v4.0.30319: InfoExtractor.exe): Loaded 'C:\WINDOWS\Microsoft.Net\assembly\GAC_MSIL\System.Xml\v4.0_4.0.0.0__b77a5c561934e089\System.Xml.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled.
'InfoExtractor.exe' (CLR v4.0.30319: InfoExtractor.exe): Loaded 'C:\WINDOWS\Microsoft.Net\assembly\GAC_MSIL\System.Core\v4.0_4.0.0.0__b77a5c561934e089\System.Core.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled.
An unhandled exception of type 'System.StackOverflowException' occurred in HtmlAgilityPack.dll




### Fiddle or Project
It's the page http://ca800.com/biz/d_1_1nrusru51nrkj.html
It contains lots of &lt;br&gt; node

### Further technical details
- HAP version: HtmlAgilityPack.1.11.16
- NET version (net472, netcore, etc.): net4.6.1

The text was updated successfully, but these errors were encountered:

JonathanMagnan · 2020-10-28T14:28:49Z

Hello @MachineLearning666 ,

I'm not sure if that something we fixed in more recent version but I tried to load this page and everything seem to work as expected:

var doc = new HtmlDocument();
doc.LoadHtml(File.ReadAllText(@"C:\Users\Jonathan\Desktop\br_issue.html"));
var html = doc.DocumentNode.InnerHtml;

Could you try it with the latest version?

Best Regards,

Jon

Performance Libraries
context.BulkInsert(list, options => options.BatchSize = 1000);
Entity Framework Extensions • Entity Framework Classic • Bulk Operations • Dapper Plus

Runtime Evaluation
Eval.Execute("x + y", new {x = 1, y = 2}); // return 3
C# Eval Function • SQL Eval Function

MachineLearning666 · 2020-10-29T02:01:19Z

Hi @JonathanMagnan, Thanks for the quick response. I think you may use view-source:http://ca800.com/biz/d_1_1nrusru51nrkj.html to get the page's html content (~180Kb). I tried it work as expected in all old/new versions. What I tried the page is from a crawler. It maybe an old version of the page. page's size is about 800Kb.
I attached the page which is from crawler. It also failed in the latest HAP.
ErrorPage2.txt

JonathanMagnan · 2020-10-29T14:49:15Z

Thank you @MachineLearning666 ,

We can now successfully reproduce it.

We will look at it.

JonathanMagnan · 2020-11-11T01:03:59Z

Hello @MachineLearning666 ,

The v1.11.28 has been released.

Could you try it and let us know if the br issue is fully fixed?

Best Regards,

Jon

MachineLearning666 · 2020-11-17T03:29:44Z

Hi @JonathanMagnan , the issue is fully fixed. Thanks a lot.

JonathanMagnan · 2020-11-17T11:50:38Z

Awesome @MachineLearning666 !

Don't hesitate to contact us with any questions, issues or feedback.

Best regards,

Jon

JonathanMagnan closed this as completed Nov 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stackoverflow exception when load pages has lots of nested <br> tags #415

stackoverflow exception when load pages has lots of nested <br> tags #415

MachineLearning666 commented Oct 28, 2020 •

edited by JonathanMagnan

Loading

JonathanMagnan commented Oct 28, 2020

MachineLearning666 commented Oct 29, 2020

JonathanMagnan commented Oct 29, 2020

JonathanMagnan commented Nov 11, 2020

MachineLearning666 commented Nov 17, 2020

JonathanMagnan commented Nov 17, 2020

stackoverflow exception when load pages has lots of nested <br> tags #415

stackoverflow exception when load pages has lots of nested <br> tags #415

Comments

MachineLearning666 commented Oct 28, 2020 • edited by JonathanMagnan Loading

Description

Exception

JonathanMagnan commented Oct 28, 2020

MachineLearning666 commented Oct 29, 2020

JonathanMagnan commented Oct 29, 2020

JonathanMagnan commented Nov 11, 2020

MachineLearning666 commented Nov 17, 2020

JonathanMagnan commented Nov 17, 2020

MachineLearning666 commented Oct 28, 2020 •

edited by JonathanMagnan

Loading