Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It's keeping all processed lines on memory making useless for processing big files #286

Closed
rkmax opened this issue Oct 19, 2018 · 5 comments
Assignees
Milestone

Comments

@rkmax
Copy link

rkmax commented Oct 19, 2018

I'm processing big files more than 800M, 3.7M lines and 10x bigger that

doing

import csv from 'csvtojson';
// ...
async processFunc(row) {
// ...
}
return csv(options)
    .fromFile(filepath)
    .subscribe(processFunc);

works fine but I'm noticing that internally the library is keeping for every parsed line, two things first the parsed line (raw string) and also some kind of object

by looking the code of the library I found (i dunno if it's the right place) that there's a Result.needEmitAll getter based on some calculations to see if it's false/true.

Could you please explain how i can force this to be false

imagen

When I force false (by changing the source code on node_modules) by the same iterations the change is notable

imagen

@Keyang
Copy link
Owner

Keyang commented Oct 20, 2018

Hi,
Weird, I never had this issue.
I have written a test script for this:

var csv=require("csvtojson");
console.log("pid",process.pid);
csv().fromFile("./1.csv")
.subscribe(function(data){
  return new Promise(function(resolve,reject){
    setTimeout(function(){
      resolve();
    },0);
  })
})
.on("done",function(){
  console.log(process.memoryUsage());
})

the 1.csv is around 1.2GB containing 15M lines of csv.
The memory is kept below 75mb.

The Result.needEmitAll is based on if .then is called. so if you have sth like below:

var csv=require("csvtojson");
console.log("pid",process.pid);
csv().fromFile("./1.csv")
.subscribe(function(data){
  return new Promise(function(resolve,reject){
    setTimeout(function(){
      resolve();
    },0);
  })
})
.then(function(){ // this will implicitly tell parser that user want all json objects in a final array.
  console.log(process.memoryUsage());
})

This will keep result in memory..

Please let me know if this is what you have experienced?

Thanks.
Keyang

@rkmax
Copy link
Author

rkmax commented Oct 20, 2018

In my case I need call .then because I need to wait for the file been processed to do more things. maybe will be great pass this as an option csv({needEmitAll: false})

@Keyang Keyang self-assigned this Oct 21, 2018
@Keyang Keyang added this to the 2.0.9 milestone Oct 22, 2018
@Keyang
Copy link
Owner

Keyang commented Jun 26, 2019

added parameter "needEmitAll"

@Keyang Keyang closed this as completed Jun 26, 2019
@avimar
Copy link

avimar commented Sep 15, 2019

Thanks, I think I may have ran into this same issue.

@stevenmasci
Copy link

Hi all, thanks for documenting this issue.

I'm having a related memory problem, not sure if it's due to the library or how I've structured my code, but I'm hoping someone might be able to point me in the right direction.

I require the complete JSON result to be built as I need need to do further processing afterwards, so this means depending on the csv file I could be consuming a lot of memory.

const content = await csv().fromString(csvString)

The problem I'm having is when the function finishes executing the garbage collector does not seem to be collecting the array of json objects once the function finishes or goes out of scope, which is what I'd expect to happen (eventually). Even forcing a collection with global.gc(); doesn't work.

Is there a possibility something in the csvtojson library is maintaining references to elements in the JSON array preventing collection?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants