[2.0.x] 6th-order jerk-controlled motion planning in real-time#10337
[2.0.x] 6th-order jerk-controlled motion planning in real-time#10337thinkyhead merged 2 commits intoMarlinFirmware:bugfix-2.0.xfrom ejtagle:bugfix-2.0.x
Conversation
|
Pretty good stuff! I see that |
|
That method is unused. It can be removed. I will not use it. It was a leftover of a previous implementation i did... (Y) |
|
So, only need to add a |
|
Marlin is using the term "Jerk" as "maximum allowable jump in speed between movement segments". |
|
I've applied the name change and added the option to configs just under the other Jerk options. This can be merged for testing as soon as it passes Travis CI. |
|
👍 ... Yes. I am pretty interested in how it performs. At least, here the improvements on smoothness of movements is pretty amazing. I haven´t tried yet, but i do suspect that this algorithm could also enable to run the machine faster... Combined with LINEAR_ADVANCE, speed improvements should be achievable :D ... |
|
what about JERK_CONTROL_BEZIER, there could be other methods in the future |
|
Is this applicable for all frame types? |
|
@hg42: Mathematically speaking, this is mostly the ideal method. On 32bit cpus, there is simply no point degrading it.. Maybe on 8bit avrs there could exist degraded versions of this method (for example, S-curves), to try to make feasible their implementations on such low powered computing platforms. @teemuatlut : Yes. It applies to all. by decomposition of forces and accelerations on printing head/bed, you always end controlling motor accelerations. That is the way Marlin handled them previously to thus method, and that is the way it is still handled after this PR Please note: I am aware that for Deltas, in fact, there are 2 masses that should be considered: One of them are the actuator masses, and the other is the printing head. The movements (acceleration/velocity) of the actuators are directly controlled by the motors themselves, and the head mass movement is a result of the composition of those movements, but the acceleration of such head has a non linear dependency on the acceleration of the motors and actuators. Even though the 6th order implementation of this PR is not completely modeling that non-linear dependency, you can consider that small movements can be thought as linear. So, the 6th order curve WILL improve movements even on deltas. On physics modeling, there is always a tradeoff between actual model and approximate modeling. Actual modeling requires to precisely measure model parameters, and those models must be able to be computed in realtime. And it also depends on how much stable is the mechanical structure of the machine, as those parameters could be varying in time!. I think an approximate model is much easier to handle and much easier to evaluate, even if not completely optimal. The idea of a 6th order bezier is just to smooth out speed increase transitions, smooth out acceleration changes. That results in no sudden changes of forces applied to the machine, And, no sudden changes of forces means no exitation of mechanically resonant frrequencies of the mechanical structure of the machine. No exitation of mechanical resonances means no vibration of the print head, and no vibration of the print head means more exact deposition of plastic into the model. Hope you get the idea... 👍 |
|
Imagine head vibration is represented by movements of beer, see how much time it takes for beer to stop moving after movement of the print head (=glass) stops. They are using S curves. We are even improving on that! Synthetos did a demo on their 6th-order controlled jerk motion planner, just imagine the pendulum to be the print head We should be getting exactly the same performance with our implementation 👍 |
Enable 6th-order jerk-controlled motion planning in real-time. Only for 32bit MCUs. (AVR simply does not have enough processing power for this!)
If only we could simulate the oscillation-dampening effects of a foamy head. That would eliminate slosh in our printed objects. |
|
Foamy head damps vibration, but that means something is exciting those vibrations. The idea of this algorithm is to suppress that excitation (probably, the force applied by the motors to the head to produce acceleration is the excitation. |
|
If i get it running, i will probably open a new PR with it 👍 |
|
@ejtagle: This is really exciting, it would definitely be interesting to see if an AVR could handle that, I guess it would be necessary to reduce the steprate as much as possible, but I think with TMC2100 one could do 1/4 multistepping instead of 1/16 to reduce it a lot.. |
|
On AVR it might be possible to segment the acceleration/deceleration portions of the move using a short table of fixed-point cosine factors. The result would be composed of shorter linear accelerations, which taken together would be generally S-shaped. |
|
i already have the bezier interpolation running in AVR . Takes 150 cycles to evaluate each point of the bezier curve. (i wrote an assembler version of the evaluator and reduced the precision of the coefficients as much as possible). |
There's also fixed-point multiplication of the reciprocal, which is faster if the divisors are known ahead of time, i.e., constants. |
|
Yes, but in this case the fixed value is in the numerator. According to this paper https://www.microsoft.com/en-us/research/wp-content/uploads/2008/08/tr-2008-141.pdf , it is possible to get maximum precision using 2 iterations of newton-raphson and a 1kb loopup table. 32bit multiplies cost about 29 cycles, so i estimate the full 32bit division to take 4*29 =120 cycles. Way less than the default GCC implementation, that takes 400+cycles to perform the exact same division with the same precision. |
|
I've got the newton-Raphson aproximation working. Only remaining to translate it to AVR assembler, and the PR for AVR will be ready... // This function computes (1<<24) / d using Newton-Raphson approximations
uint32_t get_0x1000000_div_x(uint32_t d) {
static const uint8_t inv_tab[256] = {
255,253,252,250,248,246,244,242,240,238,236,234,233,231,229,227,
225,224,222,220,218,217,215,213,212,210,208,207,205,203,202,200,
199,197,195,194,192,191,189,188,186,185,183,182,180,179,178,176,
175,173,172,170,169,168,166,165,164,162,161,160,158,157,156,154,
153,152,151,149,148,147,146,144,143,142,141,139,138,137,136,135,
134,132,131,130,129,128,127,126,125,123,122,121,120,119,118,117,
116,115,114,113,112,111,110,109,108,107,106,105,104,103,102,101,
100,99,98,97,96,95,94,93,92,91,90,89,88,88,87,86,
85,84,83,82,81,80,80,79,78,77,76,75,74,74,73,72,
71,70,70,69,68,67,66,66,65,64,63,62,62,61,60,59,
59,58,57,56,56,55,54,53,53,52,51,50,50,49,48,48,
47,46,46,45,44,43,43,42,41,41,40,39,39,38,37,37,
36,35,35,34,33,33,32,32,31,30,30,29,28,28,27,27,
26,25,25,24,24,23,22,22,21,21,20,19,19,18,18,17,
17,16,15,15,14,14,13,13,12,12,11,10,10,9,9,8,
8,7,7,6,6,5,5,4,4,3,3,2,2,1,0,0
};
if (!d) return uint32_t(-1L);
// Compute initial estimation of 0x1000000/x
// #1] Get most significant bit set on divisor
uint8_t idx = 0;
uint32_t nr = d;
if (!(nr & 0xFF0000)) {
nr <<= 8;
idx += 8;
if (!(nr & 0xFF0000)) {
nr <<= 8;
idx += 8;
if (!(nr & 0xFF0000)) {
nr <<= 8;
idx += 8;
}
}
}
if (!(nr & 0xF00000)) {
nr <<= 4;
idx += 4;
}
if (!(nr & 0xC00000)) {
nr <<= 2;
idx += 2;
}
if (!(nr & 0x800000)) {
nr <<= 1;
idx++;
}
// Isolate top 9 bits of the denominator, to be used as index into the initial estimation table
uint32_t tidx = nr >> 15, // top 9 bits. bit8 is always set
ie = 0x100 | inv_tab[tidx & 0xFF], // Get the table value. bit9 is always set
x = idx <= 8 ? (ie >> (8 - idx)) : (ie << (idx - 8)); // Position the estimation at the proper place
// #3] Now, refine estimation by newton-raphson. 2 iterations are enough
x = uint32_t((x * uint64_t((1 << 25) - x * d)) >> 24);
x = uint32_t((x * uint64_t((1 << 25) - x * d)) >> 24);
// Estimate quotient
uint32_t q = x * d;
// And remainder
uint32_t r = (1 << 24) - q;
// Check if we must adjust result
if (r >= d) x++;
// x holds the proper estimation
return uint32_t(x);
} |
|
Just to give you an idea, my instruction cycle budget is about 150 cycles maximum per ISR. The interpolator itself uses 150 cycles, and the coefficient calculation should also use 150 cycles maximum. (they never run simultaneously). So, once i translate this to assembler, it should be ready for testing. This is the interpolator itself, already done and working. But i have also to rewrite the _calc_bezier_curve_coeffs_avr function in assembler, and that last division is what i am going to replace with this newton-raphson equivalent |
|
And yes, i do prototyping of the algorithms in plain C, because i must be sure they work before translating into assembler. Assembler is extremely difficult to follow and patch. So the transcription to ASM is done once i am absolutely sure the algorithms work in each and every possible use case. |
I would be surprised if the assembler is much more optimized than what the C++ compiler can produce, given the utter simplicity of the operation. |
My old career (circa 1990) was programming 680x0 Assembler in Amiga DevPac, so I was trained to be crazy about shaving off cycles using clever techniques, counting cycles, picking the fastest op-codes, and so on. I got pretty good at optimization, wrote a nice optimized LZH decompressor, MFM disk encoder/decoder using the "blitter" co-processor, etc. Ah, those were the days… Now we can generally trust the C/C++ compilers to be smarter than most humans. Still, there are tricks the compiler might not think of… I suppose this has to run in the ISR, and there's no way we can pre-compute the curves as part of assembling the blocks in the planner…? |
|
How does M201 relate to the acceleration in this new planner? |
|
@teemuatlut: M201 accelerations right now limit the average acceleration. The peak acceleration can exceed it. We could do limit the peak acceleration, but there are reasons for not doing that (and also reasons for doing it). Usually, acceleration is used to limit skipping steps. Steps are lost because of sudden changes in direction where the motor has to overcome the mass inertia of the mechanism of the printer. With this algorithm, there are no sudden changes in applied force, as there are no sudden changes in acceleration. Thus there are very good reasons to believe that the maximum acceleration can be increased quite a bit compared to the old method without losing steps. That is why i chose to use the m201 acceleration as the mean acceleration. Due to the way the curve fitting works, the time to reach an specific distance is the same right now with either of the algorithms (trapezoidal or bezier) The other reason for keeping as it is is that limiting maximul acceleration requires even more calculations, and will surely increase printing times a lot. @thinkyhead: On any 32bit or even 16bit processor i would agree with you. GCC is pretty good at doing all kind of optimizations and transformations to the code to save cycles, but lacks support for specific architectural optimizations that sometimes can improve execution time quite a bit for specific algorithms: For example, in ARM Cortex M3/M4 there are multiply-add instructions (called MACs) that allow you to compute the product of 2 32 bit numbers, get a 64bit result and add that result to a 64 bit accumulator,all that in 5 instruction cycles in M3 or just 1 cycle in M4. GCC does not use that instruction and so compiles that sequence to a load of plain 32bit multiplications (1 cycle per plain multiplication in M3, 2 cycles per plain multiplication in M4) and 32 bit additions. The resulting code runs 5 to 10 times slower than using that specific MAC instruction. That is exactly the instruction we use to evaluate beziers.. ;) On the AVR the situation is extremely worse: The processor has 8bit registers and no barrel shifter, but ANSI C states that all types must be promoted to 16 bits to perform operations on them. Sometimes GCC is able to realize that there is no need for that promotion and is able to carry the operation using just one 8bit register, but there are several times when it does not realize that: Examples: uint8_t v,a;
v &= ~a; // promotes the negation to 16 bits, promotes v to 16 bits, does the operation, discards the top 8 bits and then stores the result.
uint32_t v1; uint8_t v2;
v2 = 0x14;
v1 |= v2; // v2 is PROMOTED to 32 bits (ouch!), and then a 32bit or operation is performed over v1.All shifts left and right promote to 16 or 32 bits and are painfully done one bit at a time using a loop. The procesdor has a swap instruction that should allow 4bit shifts but it is rarely used by GCC. And sometimes ot is cheaper to rewrite (in asm) something like as the first loop takes 3 iterations, the alternative version takes 2 instruction cycles (SWAP + right shift). GCC does not use that optimization at all. GCC is using some kind of peephole rule based optimizer, and not doing the analysis of ranges and proper decomposition to 8bit operations. The main problem is that GCC was created with 32bit archs in mind, so there was no effort in the analysis passes required to reduce operations to 8bit primitives, as 32bit archs do not and did not require them - Thus unless the expression is very simple, GCC will always produce suboptimal code sequences for AVR. The other problem is the C standard: GCC cant express a 24bit multiplication. There is no 24bit type in C. But as AVR has an 8 bit multiplier, a 32 bit multiplication by 32bitx32bit to 32bit multiplication is rwsolved by GCC as 10 8bitx8bit multiplies and about 8 32bit adds. that give an execution time of 10+4*8=42 cycles. If you need a 64bit result from a 32bitx32bit product, the only way to get in ANSI VñC it is to expand to 64 bits both operands. Lets assume GCC knows and just does not expand, instead solves the 32x32 to 64bit product : AVR requires 16 8bit multiplications and 12 64bit additions to compute the result: 16+12*8 =112 cycles. On the Bezier algorithm, I need about 3 24bitx16bitmultiplies, and only need the top highest 24 bits of the result: takes 6 8bit multiplies and 4 24bit adds: 6+3×4 =18 cycles. Also need 4 16bitx16bit to 16bit multiplies, and only the top 16 bits of the result: takes 4 8bit multiplies + 3 16bit adds: 4+3×2=10 cycles. And also need an extra 32bitx32bitto16bit (only need the top 16 nits of the result). This takes 6 8bit multiplirs and 6 16bit adds: 6+2×12=30 cycles. So the total amount of cycles for the bezier evaluation is 30 + 18x3 + 4x10 = 124 cycles. If done in plain C, that would take 112cycles×3 + 5×40= 456 cycles. So the assemblee version is 3.5x times faster. With the newton-Raphson divide, is more or less the same problem. The asm version ends being 3x faster than the plain division used by GCC. |
|
Yes, we could precompute the curve coefficients in the planner. And also we could, at the expense of a lot of ram,precompute the curve values. The first one is perfectly doable, the 2nd one would be a problem due to the known lack of ram problem of this device.. Placing extra things in the planner is something i dont like,because the planner uses an iterative approach to recompute speeds of movements everytime a newcmovement is queued, leading to several redos if each calculation... i prefer to delay as much as possible those unneded for the planner algorithm calculations. |
|
@ejtagle thanks for those extensive explanations. They answer the questions I asked in an email to your github email address, so simply forget that email. I was worrying about the much increased maximum acceleration (may be 2x-3x, right?). Btw. you have quite impressive knowledge... |
|
@thinkyhead : After all kind of optimizations to the ASM version of coefficient estimation routine, it takes 360 cycles to complete. Doubles the time of the bezier point evaluator, that takes 150cycles to evaluate. |
Good explanation of the compiler issues, thanks! I guess I shouldn't be surprised that GCC is leaving 8 bit architectures behind. I'd be interested to see your 360-cycle assembler version, just in case my "naive questions" might lead to a breakthrough that could cut it in half… by caching previous results or other such trickery. Meanwhile, coefficient estimation in the planner does seem the only viable way to go. |
|
You are right. Caching could improve things. On32bit cpus has no sense (zivision takes 10 cycles) but on AVR maybe.. ;) -Of course, @thinkyhead. All improvements to the routines/algorithms are always welcome ! |
|
BTW. i managed to reduce the count of cycles of the coefficient estimator to 210 cycles. Seems that we can get a proper estimate using some small tables and just one Newton-Raphson iteration |
|
Any updates on this? Pretty interesting, I believe this could be a large step in quality, specially with the large vibration-prone beds currently in use, such as CR10, etc. EDIT: Found the pull implementing it for AVR, will try and report back since I couldn't find much feedback on the actual printing results. |
|
It does what it says, nearly no ringing. But i do suspect that the linear advance feature (if you use it) also requires to be updated to take into account the new acceleration profile that this interpolation produces. I am thinking about that right now |
|
This is relevant and interesting, what do you guys think? |
|
I´d agree, but that is not the main problem in 3D printing. The intent of the bezier curves is not to increase printing speed, rather than, to reduce resonances. As a matter of fact, the printing time with the bezier curve is exactly the same as without it. But with less vibration of the mechanical structure. |
|
a test without mass is totally nonsense. I once tested Klipper on lpc1768 and could reach very insane speeds and accelerations that were only limited by CPU power. 2000mm/s and 20000mm/s^2 were easy and still not the limit, no obvious skipping. |
|
@hg42 : Without trying to defend the article writer, what i get out from it is that the author is trying to maximize the increase of speed as much as possible, and trying to reach the maximum possible speed the motor can give the fastest it can be done, and it is true that torque decreases as speed increases (any motor has that limitation (http://lancet.mit.edu/motors/motors3.html) . The same curves also apply to stepper motors. |
|
@ejtagle : with mentioning my tests, I actually wanted to support your statement, that speed is not the problem. Btw. I did some tests with the new s-curve on Marlin on my corexy (X5S with lots of mechanical improvements to prevent vibrations, e.g. diagonal bars, fixating the z-axis rods, etc.). With my low current TMC2130 setup, I always had to limit acceleration to about 1000 mm/s^2 to prevent skips. So at least there are operating conditions where the s-curve doesn't make a difference. |
Only for 32bit CPUs as AVR simply does not have enough processing power for this.
As explained here (https://github.com/synthetos/TinyG/wiki/Jerk-Controlled-Motion-Explained), this PR modifies the Marlin planner so it plans movements using a 6th-order jerk-controlled motion.
The idea is "simple": We still use the Trapezoidal profile to "coalesce" movements, and we do it exactly the way it was done previously.
But, in the stepper ISR, instead of estimating the time to the next step as the inverse of the speed, and the instantaneous speed as a linear function of time and acceleration, what we do with the new algorithm is to fit a bezier curve to that trapezoidal shape.
The bezier curve warrants us that the first derivative of velocity starts and ends as 0, the 2nd derivative of speed (=jerk) also starts and ends as 0.
Then, we evaluate the bezier curve at realtime (yay!! :D) and compute instanteous speed as a function of time. That speed is used to compute time to the next ISR.
The result is mostly incredible: There is no vibration of the machine itself while printing. You have to see in action to believe it! :D
Now the bad news: To be able to evaluate in realtime, i had to convert the coefficients to fixed point (no problem there) and using specific instructions available on all ARM Cortex M3/M4 CPU, that allow us to perform 32x32 to 64bit multiplications in just 5 cycles, i was able to evaluate the bezier curve in just 40clock cycles.
The bad news is this evaluation is impossible to perform in realtime in AVR (5 multiplications of 32x32bits ... probably more than 200 cycles... I really doubt AVR is able to perform it. The generic code is there, just in case anyone wants to try it...)
To enable this new jerk-free planner, just define USE_JERK_CONTROL in configuration.h and recompile. There are no other changes required, and try it.
There is an small caveat here: The new planner could exceed the maximum acceleration configured for the machine temporarily: That is not a problem, as the change in acceleration is extremely slow compared to the previous planner.
Well, as this change is very complex to evaluate, i ask people to try it and comment on their experiences, and i am open to suggestions here.
As a curiosity, the code is NOT the one used by Synthetos. They use a fixed rate Step ISR, and that allows them to use a forward differencing technique to evaluate the bezier curve. We can´t use that method, so there is no way around evaluating the Bezier with the full formula!
This PR probably could address FR #6193, at least for 32bit CPUs
On AVR, assuming 90 cycles per 32x32->64 multiplication, we need to perform 8 operations, to evaluate the bezier curve would take 720 cycles. That would, by itself, limit the step ISR rate to 22khz... And the ISR need to do much more than that.. Anyone willing to write an ASM version of the _calc_bezier_curve_coeffs function for AVR ? (as a base we could use http://www.vfx.hu/avr/download/mult32.asm) -- Right now there is generic code written in C that should work, but i don´t know if it will run in realtime or not...