Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stackoverflow exception when load pages has lots of nested <br> tags #415

Closed
MachineLearning666 opened this issue Oct 28, 2020 · 6 comments

Comments

@MachineLearning666
Copy link

MachineLearning666 commented Oct 28, 2020

Description

There is a stackoverflow issue when load page has lots nested tags, i tried to set MaxDepthLevel=100, OptionMaxNestedChildNodes=50, OptionAutoCloseOnEnd=true, OptionFixNestedTags=true, but it didn't work

Exception thrown by the code htmlDoc.LoadHtml(html);

try
            {
                HtmlDocument.MaxDepthLevel = 100;
                htmlDoc = new HtmlDocument();

                // cause stack overflow issue if there is too many nested links
                htmlDoc.OptionMaxNestedChildNodes = 50;
                htmlDoc.OptionAutoCloseOnEnd = true;
                htmlDoc.OptionFixNestedTags = true;
                //htmlDoc.option
                htmlDoc.LoadHtml(html);

                // log, disable later
                Console.WriteLine("load page successfully");
            }
            catch (Exception)
            {
                return;
            }

Exception

StackOverflow

Exception message:
Stack trace:
'InfoExtractor.exe' (CLR v4.0.30319: DefaultDomain): Loaded 'C:\WINDOWS\Microsoft.Net\assembly\GAC_32\mscorlib\v4.0_4.0.0.0__b77a5c561934e089\mscorlib.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled.
'InfoExtractor.exe' (CLR v4.0.30319: DefaultDomain): Loaded 'D:\ListNavRepo\ListNav\InfoExtractor\bin\Debug\InfoExtractor.exe'. Symbols loaded.
'InfoExtractor.exe' (CLR v4.0.30319: InfoExtractor.exe): Loaded 'C:\WINDOWS\Microsoft.Net\assembly\GAC_MSIL\System\v4.0_4.0.0.0__b77a5c561934e089\System.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled.
'InfoExtractor.exe' (CLR v4.0.30319: InfoExtractor.exe): Loaded 'D:\ListNavRepo\ListNav\InfoExtractor\bin\Debug\HtmlAgilityPack.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled.
'InfoExtractor.exe' (CLR v4.0.30319: InfoExtractor.exe): Loaded 'C:\WINDOWS\Microsoft.Net\assembly\GAC_MSIL\System.Xml\v4.0_4.0.0.0__b77a5c561934e089\System.Xml.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled.
'InfoExtractor.exe' (CLR v4.0.30319: InfoExtractor.exe): Loaded 'C:\WINDOWS\Microsoft.Net\assembly\GAC_MSIL\System.Core\v4.0_4.0.0.0__b77a5c561934e089\System.Core.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled.
An unhandled exception of type 'System.StackOverflowException' occurred in HtmlAgilityPack.dll




### Fiddle or Project
It's the page http://ca800.com/biz/d_1_1nrusru51nrkj.html
It contains lots of &lt;br&gt; node

### Further technical details
- HAP version: HtmlAgilityPack.1.11.16
- NET version (net472, netcore, etc.): net4.6.1
@JonathanMagnan
Copy link
Member

Hello @MachineLearning666 ,

I'm not sure if that something we fixed in more recent version but I tried to load this page and everything seem to work as expected:

var doc = new HtmlDocument();
doc.LoadHtml(File.ReadAllText(@"C:\Users\Jonathan\Desktop\br_issue.html"));
var html = doc.DocumentNode.InnerHtml;

Could you try it with the latest version?

Best Regards,

Jon


Performance Libraries
context.BulkInsert(list, options => options.BatchSize = 1000);
Entity Framework ExtensionsEntity Framework ClassicBulk OperationsDapper Plus

Runtime Evaluation
Eval.Execute("x + y", new {x = 1, y = 2}); // return 3
C# Eval FunctionSQL Eval Function

@MachineLearning666
Copy link
Author

Hi @JonathanMagnan, Thanks for the quick response. I think you may use view-source:http://ca800.com/biz/d_1_1nrusru51nrkj.html to get the page's html content (~180Kb). I tried it work as expected in all old/new versions. What I tried the page is from a crawler. It maybe an old version of the page. page's size is about 800Kb.
I attached the page which is from crawler. It also failed in the latest HAP.
ErrorPage2.txt

@JonathanMagnan
Copy link
Member

Thank you @MachineLearning666 ,

We can now successfully reproduce it.

We will look at it.

@JonathanMagnan
Copy link
Member

Hello @MachineLearning666 ,

The v1.11.28 has been released.

Could you try it and let us know if the br issue is fully fixed?

Best Regards,

Jon

@MachineLearning666
Copy link
Author

Hi @JonathanMagnan , the issue is fully fixed. Thanks a lot.

@JonathanMagnan
Copy link
Member

Awesome @MachineLearning666 !

Don't hesitate to contact us with any questions, issues or feedback.

Best regards,

Jon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants