This month was mostly spent on removing global state from the regex engine, making re-entrantcy less error-prone. The extract from the merge commit description below gives you all the details you could ever want.
Apart from that I spent...
This month was mostly spent on removing global state from the regex engine, making re-entrantcy less error-prone. The extract from the merge commit description below gives you all the details you could ever want.
Apart from that I spent a few hours re-enabling Copy-on_Write by default post the 5.18.0 release, plus a few other bits and pieces.
It turns out that I have finally used up all the hours on my grant plus extensions. I really must get round to applying for a new grant sometime soon!
commit 7d75537ea64f99b6b8b8049465b6254f5d16c693
Merge: 3a74e0e 28d03b2
Author: David Mitchell
AuthorDate: Sun Jun 2 20:59:58 2013 +0100
[MERGE] get rid of (most) regex engine global state
Historically, perl's regex engine was based on Henry Spencer's regex code, which was all contained within a single file and used a bunch of static variables to maintain the state of the current regex compile or execution.
This was perfectly adequate when only a single thread could execute a regex, and where the regex engine couldn't be called re-entrantly.
In 5.0, these vars were promoted to be full global vars as perl became embeddable; then in 5.5 they became part of the perl interpreter struct when MULTIPLICITY was introduced.
In 5.6, the Perl_save_re_context() function was introduced that did a whole bunch of SAVEPPTR type stuff, and was called in various places where it was possible that the engine may be re-entered, to avoid overwriting the global state of the currently executing regex. This was particularly important now that Unicode had been introduced, and certain character classes could trigger a call to the perl-level SWASH code, which could itself execute a regex; and where /(?{ ... })/ code blocks could be called which could do likewise.
In 5.10, The various PL_foo variables became fields within the new re_save_state struct, and a new interpreter var, PL_reg_state, was introduced which was of type struct re_save_state. Thus, all the individual vars were still global state, but it became easier to save them en-mass in Perl_save_re_context() by just copying the re_save_state stuct onto the save stack and marking it with the new SAVEt_RE_STATE type. Perl_save_re_context() was also expanded to be responsible for saving all the current $1 values.
Up until now, that is roughly how things have remained, except for bug fixes such as discovering more places where Perl_save_re_context() needs to be called.
Note that, philosophically speaking at least, this is broken in two ways. First, there's no good reason for the internal current state of the executing regex engine to be stored in a bunch of global vars; and secondly we're relying on potential callers of the regex engine (like the magic tie code for example), to be responsible for being aware that they might trigger re-entrancy in the regex engine, and to thus do Perl_save_re_context() as a precaution. This is error-prone and hard to prove correct. (As an example, Perl_save_re_context() is only called in the tie code if the tie code in question is doing a tied PRINT on STDERR; clearly an unusual use case that someone spotted was buggy at some point).
The obvious fix, and the one performed by the series of commits in this merge, is to make all the global state local to the regex engine instead. Indeed, there is already a struct, regmatch_info, that is allocated as a local var in regexec(), then passed as an argument to the various lower-level functions called from regexec(). However, it only had limited use previously, so here we expand the number of functions where it is passed as an argument. In particular, it is now also created by re_intuit_start(), the other main run-time entry point to the regex engine.
However, there is a problem with this, in that various regex vars need cleaning up on croak (e.g. they point to a malloced buffer). Since the regmatch_info struct is just a local var on the C stack, it will be lost by the longjmp done by a croak() before leave_scope() can clear