Skip to content

Reduce memory utilization on the Driver during the commit phase#16120

Merged
arhimondr merged 3 commits intoprestodb:masterfrom
arhimondr:optimize-partition-update
May 20, 2021
Merged

Reduce memory utilization on the Driver during the commit phase#16120
arhimondr merged 3 commits intoprestodb:masterfrom
arhimondr:optimize-partition-update

Conversation

@arhimondr
Copy link
Member

In Presto each writer produces a PartitionUpdate object that contains file names and other meta information for the files being written. This information is then collected on the driver to perform a final commit (do file renames). In some cases this meta information could be quite large. This patch tries to optimize several things:

  • Reduce PartitionUpdate memory footprint on the Driver by serializing to SMILE instead of JSON and applying ZSTD compression
  • Release serializes and compressed pages as soon as they read by the engine on the driver. This should help avoid double memory utilization
== RELEASE NOTES ==

Presto on Spark Changes
* Reduce commit memory footprint on the Driver

@arhimondr
Copy link
Member Author

This is all in addition to the #16036, that should significantly reduce memory utilization on the driver as the statistic pages no longer have to be buffered in the TableFinishOperator for the Presto on Spark usecase (Thanks @viczhang861 for optimizing it!)

Copy link
Contributor

@viczhang861 viczhang861 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to make title "Release inmemory input pages incrementally" better, inmemory is only used for PrestoSparkTaskInputs

@arhimondr arhimondr force-pushed the optimize-partition-update branch from a089538 to 07a07d0 Compare May 19, 2021 22:44
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra "the" after "compress"

@arhimondr arhimondr force-pushed the optimize-partition-update branch 2 times, most recently from d3c8757 to 5af3c26 Compare May 20, 2021 13:39
arhimondr added 3 commits May 20, 2021 10:13
To decrease memory pressure pages from the inmemory input can be
released as soon as they are read by the Spark source operator
@arhimondr arhimondr force-pushed the optimize-partition-update branch from 5af3c26 to faa30da Compare May 20, 2021 14:13
Comment on lines +1182 to +1192
try (ByteArrayOutputStream output = new ByteArrayOutputStream();
ZstdOutputStreamNoFinalizer zstdOutput = new ZstdOutputStreamNoFinalizer(output)) {
codec.writeBytes(zstdOutput, instance);
zstdOutput.close();
output.close();
return output.toByteArray();
}
catch (IOException e) {
throw new UncheckedIOException(e);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try (ByteArrayOutputStream output = new ByteArrayOutputStream();
ZstdOutputStreamNoFinalizer zstdOutput = new ZstdOutputStreamNoFinalizer(output)) {
codec.writeBytes(zstdOutput, instance);
zstdOutput.close();
output.close();
return output.toByteArray();
}
catch (IOException e) {
throw new UncheckedIOException(e);
}
}
try (ByteArrayOutputStream output = new ByteArrayOutputStream()) {
try (ZstdOutputStreamNoFinalizer zstdOutput = new ZstdOutputStreamNoFinalizer(output)) {
codec.writeBytes(zstdOutput, instance);
}
return output.toByteArray();
}
catch (IOException e) {
throw new UncheckedIOException(e);
}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory it should be the same. Java guarantees to close all resources in a reverse order.

public class Main
{
    private static class Closeable1
            implements Closeable
    {

        @Override
        public void close()
        {
            System.out.println("Close Closeable1");
        }
    }

    private static class Closeable2
            implements Closeable
    {

        @Override
        public void close()
        {
            System.out.println("Close Closeable2");
        }
    }

    public static void main(String[] args)
    {
        try (Closeable1 closeable1 = new Closeable1(); Closeable2 closeable2 = new Closeable2()) {
            System.out.println("Body");
        }
    }
}

Prints

Body
Close Closeable2
Close Closeable1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment was more around not calling close() explicitly, since the try-with-resources does that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry. I misunderstood. Let me create a patch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm not sure if it is correct to call return output.toByteArray() before the ByteArrayOutputStream is closed? The implementation allows it, but I wonder if that's what is expected?

return scala.reflect.ClassTag$.MODULE$.apply(clazz);
}

public static <T> Iterator<T> getNullifyingIterator(List<T> list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you can't just call remove() on the iterator because that's a code change in Spark?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory remove on the iterator for an ArrayList is an O(N) operation (because it has to shift the "tail"). Although in practice I don't think it is going to be an issue, just being on a safer side (just in case there's a query that produces a list with a number of pages that would make this complexity to create a problem)

@arhimondr arhimondr merged commit 4bd5dc5 into prestodb:master May 20, 2021
@arhimondr arhimondr deleted the optimize-partition-update branch May 20, 2021 15:53
@sujay-jain sujay-jain mentioned this pull request May 21, 2021
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants