Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataFrame] LoadCsv throws IndexOutOfRangeException #5654

Open
luisquintanilla opened this issue May 6, 2020 · 6 comments
Open

[DataFrame] LoadCsv throws IndexOutOfRangeException #5654

luisquintanilla opened this issue May 6, 2020 · 6 comments
Assignees
Labels
Microsoft.Data.Analysis All DataFrame related issues and PRs

Comments

@luisquintanilla
Copy link
Contributor

Using the following dataset: https://www.kaggle.com/austinreese/craigslist-carstrucks-data

The dataset itself is large (1.4 GB) and sparse.

Using the following code:

var df = DataFrame.LoadCsv("vehicles.csv");

Throws the following exception:

System.IndexOutOfRangeException: 'Index was outside the bounds of the array.'

Here is the stacktrace:

   at Microsoft.Data.Analysis.DataFrame.GetColumnName(String[] columnNames, Int32 columnIndex)
   at Microsoft.Data.Analysis.DataFrame.CreateColumn(Type kind, String[] columnNames, Int32 columnIndex)
   at Microsoft.Data.Analysis.DataFrame.LoadCsv(Stream csvStream, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int64 numberOfRowsToRead, Int32 guessRows, Boolean addIndexColumn, Encoding encoding)
   at Microsoft.Data.Analysis.DataFrame.LoadCsv(String filename, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int32 numRows, Int32 guessRows, Boolean addIndexColumn, Encoding encoding)
   at ConsoleApp.Program.Main(String[] args) in C:\Users\luquinta.REDMOND\Development\UsedCarCraigslistSample\ConsoleApp\Program.cs:line 16
@luisquintanilla
Copy link
Contributor Author

Using similar functionality in Pandas does not throw these errors and the dataset loads successfully.

image

VehiclesEDA.zip

@pgovind pgovind self-assigned this May 6, 2020
@pgovind
Copy link

pgovind commented May 6, 2020

I looked at this locally and found the issue. The description column has many ,s inside it, so we're not able to parse the number of columns correctly. Instead of detecting 25 columns, we detect 150+ columns. Let me explore adding a new IO csproj to help with LoadCsv bugs such as this one

@luisquintanilla
Copy link
Contributor Author

Great! Thanks @pgovind . This is a tricky dataset. I tried using LINQ and IO but ran into a similar issue so I understand the challenges. At least it's a way to see all the edge cases 🙂

@liugaocn
Copy link

Make sure only use '\t'. ',' does not work properly. If you have a string stored in a cell as "string1, string2" and you happen to use ',' as the separator, Microsoft.Data.Analysis will interpret it as two cells. This is why I have to check and convert all my comma-separated txts to tab-separated with Excel first.

@pgovind pgovind transferred this issue from dotnet/corefxlab Mar 6, 2021
@pgovind pgovind added the Microsoft.Data.Analysis All DataFrame related issues and PRs label Mar 6, 2021
@luisquintanilla
Copy link
Contributor Author

Related to #5647

@R0315
Copy link

R0315 commented Mar 15, 2023

I can't comment on #5647, but I'm seeing several folks say this issue is resolved as of a past version and want to share that it just happened to me last night.

I attempted to combine csvs that all have the same column headers using the below code and got the below error. I ended up having to use Pandas. But I would prefer to be able to immerse myself in c# more so I can get better using it. If I don't specify a separator, the data will load into a dataframe, but it will be misaligned in many places. Pandas handled it without issue.

If it's just me doing something incorrectly or not doing something I should be I would appreciate being pointed in the correct direction. It's difficult to find solid documentation on working with dataframes in c# like this.

EDIT: Updating my below code I can get it to read into a dataframe without issue. But I cannot save it to a csv without the issue persisting using DataFrame.SaveCsv(). I tried specifying a separator but it still results in misaligned data in the output file.

Context: using a Polyglot notebook
Windows 11
Microsoft.Data.Analysis version: 0.21.0-preview.22621.2
.NET Version: 7.0.201

//Building one data frame from a folder of CSV files
string folder_path = @"<path/to/folder>";
var files = new DirectoryInfo(folder_path).GetFiles("*.*",SearchOption.TopDirectoryOnly);

var df = DataFrame.LoadCsv(files[0].FullName);

foreach (var file in files.Skip(0))
{
    var tempDF = DataFrame.LoadCsv(file.FullName);
    df = df.Append(tempDF.Rows,inPlace:true);
}
DataFrame.SaveCsv(df,@"path\to\csv")
Error: System.FormatException: Line 5 has less columns than expected
at Microsoft.Data.Analysis.DataFrame.GuessKind(Int32 col, List`1 read)
at Microsoft.Data.Analysis.DataFrame.ReadCsvLinesIntoDataFrame(WrappedStreamReaderOrStringReader wrappedReader, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int64 numberOfRowsToRead, Int32 guessRows, Boolean addIndexColumn)
at Microsoft.Data.Analysis.DataFrame.LoadCsv(Stream csvStream, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int64 numberOfRowsToRead, Int32 guessRows, Boolean addIndexColumn, Encoding encoding)
at Microsoft.Data.Analysis.DataFrame.LoadCsv(String filename, Char separator, Boolean header, String[] columnNames, Type[] dataTypes, Int32 numRows, Int32 guessRows, Boolean addIndexColumn, Encoding encoding)
at Submission#6.<<Initialize>>d__0.MoveNext()
--- End of stack trace from previous location ---
at Microsoft.CodeAnalysis.Scripting.ScriptExecutionState.RunSubmissionsAsync[TResult](ImmutableArray`1 precedingExecutors, Func`2 currentExecutor, StrongBox`1 exceptionHolderOpt, Func`2 catchExceptionOpt, CancellationToken cancellationToken)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Microsoft.Data.Analysis All DataFrame related issues and PRs
Projects
None yet
Development

No branches or pull requests

4 participants