Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -895,6 +895,8 @@ void waitForAckedSeqno(long seqno) throws IOException {
try (TraceScope ignored = dfsClient.getTracer().
newScope("waitForAckedSeqno")) {
LOG.debug("{} waiting for ack for: {}", this, seqno);
int dnodes = nodes != null ? nodes.length : 3;
int writeTimeout = dfsClient.getDatanodeWriteTimeout(dnodes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeout is very long. For a 3 node pipeline, it will be 8 minutes + 3 * 5 seconds (for the extension).

I'm not sure I have a better suggestion for the timeout.

One question - I believe we saw this problem in a Hung Hive Server 2 process. Do we know how this problem causes the entire HS2 instance to get hung? I would have thought this issue would block the closing of a single file on HDFS and other files open within the same client could still progress as normal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not much troubleshooting in Hive @sodonnel. It looks whole HS2 instance was hung. It didn't accept any new connections.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I guess we can go ahead with this change, but even with it, the HS2 may well get hung for 8+ minutes. Its hard to know for sure without knowing why this problem caused the whole instance to hang.

long begin = Time.monotonicNow();
try {
synchronized (dataQueue) {
Expand All @@ -905,6 +907,16 @@ void waitForAckedSeqno(long seqno) throws IOException {
}
try {
dataQueue.wait(1000); // when we receive an ack, we notify on
long duration = Time.monotonicNow() - begin;
if (duration > writeTimeout) {
LOG.error("No ack received, took {}ms (threshold={}ms). "
+ "File being written: {}, block: {}, "
+ "Write pipeline datanodes: {}.",
duration, writeTimeout, src, block, nodes);
throw new InterruptedIOException("No ack received after " +
duration / 1000 + "s and a timeout of " +
writeTimeout / 1000 + "s");
}
// dataQueue
} catch (InterruptedException ie) {
throw new InterruptedIOException(
Expand Down