Welcome to the second part of the “PHP’s Source Code For PHP Developers†series.
In the previous part ircmaxell explained where you can find the PHP source code and how it is basically structured and also gave a small introduction to C (as that’s the language PHP is written in). If you missed that post, you probably should read it before starting with this one.
What we’ll cover in this article is locating the definitions of internal functions in the PHP codebase, as well as understanding them.
How to find function definitions
For a start, let’s try to find out how the strpos
function is defined.
The first thing to try, is to go to the PHP 5.4 source code root and type strpos
into the search box at the top of the page. The result will be a huge listing of strpos
occurrences in the PHP source code.
As this doesn’t really help us much, we use a little trick: Instead of searching for just strpos
, we search for "PHP_FUNCTION strpos"
instead (don’t forget the quotes, they are important).
Now we are left with only too entries:
/PHP_5_4/ext/standard/
php_string.h 48 PHP_FUNCTION(strpos);
string.c 1789 PHP_FUNCTION(strpos)
First thing to notice is that both occurrences are in the ext/standard
folder. This is exactly where one would expect to find them, as the strpos
function (together with pretty much all other string, array and file functions) is part of the standard
extension.
Now open both links in new tabs and see what code hides behind them.
You’ll find that the first link leads you to the php_string.h
file, which is full of code looking like this:
// ...
PHP_FUNCTION(strpos);
PHP_FUNCTION(stripos);
PHP_FUNCTION(strrpos);
PHP_FUNCTION(strripos);
PHP_FUNCTION(strrchr);
PHP_FUNCTION(substr);
// ...
This is exactly how a typical header file (a file ending in .h
)
looks like: A plain list of functions which are defined elsewhere. We
aren’t really interested in this, as we already know what we’re looking
for.
The second link is much more interesting: It leads to the string.c
file, which contains the actual source code of the function.
Before I’ll walk you through the code step by step, I’d recommend you to try and understand the function by yourself. It’s a really simple function and most things should be clear even if you don’t know the exact details.
The skeleton of a PHP function
All PHP functions share the same basic structure. At the top there are a few variable declarations, then there is a zend_parse_parameters
call, then comes the main logic, with RETURN_***
and php_error_docref
calls intermixed.
So, let’s start with the variable declarations:
zval *needle;
char *haystack;
char *found = NULL;
char needle_char[2];
long offset = 0;
int haystack_len;
The first line declares needle
as being a pointer to a zval
. A zval
is PHP’s internal representation of an arbitrary PHP value. How exactly it looks like will be subject of the next post.
The second line declares haystack
as a pointer to a
character. At this point you’ll have to remember that in C, arrays are
represented by pointers to their first value. I.e. the haystack
will point to the first character of the $haystack
string you passed in. Then haystack + 1
will point to the second character, haystack + 2
to the third, and so on. So one could read in the whole string by always incrementing the pointer by one.
The problem arising here is that PHP has to know when the string
ends. Otherwise it would always keep incrementing the pointer without
ever stopping. In order to deal with this, PHP also stores an explicit
length, here in the haystack_len
variable.
The last declaration of interest to us at this point is the offset
variable, which will be used to store the third parameter of the
function: the offset to start searching at. It is declared as a long
, which is an integer datatype, just like int
. The difference between those two is not of importance here, but you should know that PHP integers are stored in long
s and string lengths are stored in int
s.
Now let’s look at the next three lines:
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "sz|l", &haystack, &haystack_len, &needle, &offset) == FAILURE) {
return;
}
What these lines basically do, is take the parameters that were passed to the function and put them into the variables, which were declared above.
The first argument to the function is the number of arguments passed. This number is provided by the ZEND_NUM_ARGS()
macro.
The next argument is the TSRMLS_CC
macro, which is kind
of an idiosyncrasy of PHP. You’ll find this strange macro scattered
across pretty much the whole PHP code base. It is part of the Thread
Safe Resource Mananger (TSRM), which ensures that PHP doesn’t mix up
variables between multiple threads. This is unimportant to us, so
whenever you see TSRMLS_CC
(or TSRMLS_DC
) in
the code, just ignore it. (A strangeness which you might have noticed,
is that there is no comma before this “argumentâ€. This has to do with
the fact, that depending on whether or not you are using a thread-safe
build, the macro will either evaluate to nothing or to , tsrm_ls
. So basically the comma is part of the macro.)
Now comes the important stuff: The "sz|l"
string specifies which parameters the function accepts:
s // first parameter is a *s*tring
z // second parameter is a *z*val (an arbitrary value)
| // the following parameters (here just one) are optional
l // third parameter is a *l*ong (an integer)
There are more type specifiers than s
, z
and l
, but most should be clear from the character. For example b
is a boolean, d
is a double (floating point number), a
is an array, f
is a callback (function) and o
is an object.
The remaining arguments &haystack, &haystack_len, &needle, &offset
specify the variables to put the arguments into. As you can see, they are all passed by reference (&
), which means that not the variables themselves are passed, but pointers to them.
After this call haystack
will contain the haystack string, haystack_len
the length of that string, needle
the needle value and offset
the starting offset.
Additionally the function is checked for FAILURE
(which happens if you try to pass invalid arguments to the function, e.g an array to a string parameter). In this case zend_parse_parameters
will throw a warning and the code of the function just return
s (which will eventually return null
to the userland PHP code).
So after the parameters are parsed, the main function body starts:
if (offset < 0 || offset > haystack_len) {
php_error_docref(NULL TSRMLS_CC, E_WARNING, "Offset not contained in string");
RETURN_FALSE;
}
What this code does is pretty obvious. If the offset is out of bounds an E_WARNING
level error is thrown through php_error_docref
and then false is returned using the RETURN_FALSE
macro.
php_error_docref
is the error function you’ll mainly find in extensions (i.e. the ext
folder). The name comes from the fact that it emits a reference to the
documentation in the error message (you know, the one that never
works…). Additionally there is the zend_error
function, which is mainly used by the Zend Engine, but also occurs in extension code from time to time.
Both functions use sprintf
-like
formatting, thus error messages can contain placeholders, which are
then filled using the following arguments. Here is an example:
php_error_docref(NULL TSRMLS_CC, E_WARNING, "Failed to write %d bytes to %s", Z_STRLEN_PP(tmp), filename);
// %d is filled with Z_STRLEN_PP(tmp)
// %s is filled with filename
Let’s proceed in the code:
if (Z_TYPE_P(needle) == IS_STRING) {
if (!Z_STRLEN_P(needle)) {
php_error_docref(NULL TSRMLS_CC, E_WARNING, "Empty delimiter");
RETURN_FALSE;
}
found = php_memnstr(haystack + offset,
Z_STRVAL_P(needle),
Z_STRLEN_P(needle),
haystack + haystack_len);
}
The first five lines should be clear: This branch is only executed if the needle
is a string and an error is thrown if it is empty. Then comes the interesting part: php_memnstr
is called, which is the function doing the main work. As always you can click on the function name to see its source code.
php_memnstr
returns the pointer to the first occurrence of the needle in the haystack (that’s why the found
variable is declared as char *
,
i.e. a pointer to character). From this the offset can be easily
computed by subtracting the two pointers, as can be seen at the end of
the function:
RETURN_LONG(found - haystack);
Finally, let’s look at the branch which is taken when the needle
is not a string:
else {
if (php_needle_char(needle, needle_char TSRMLS_CC) != SUCCESS) {
RETURN_FALSE;
}
needle_char[1] = 0;
found = php_memnstr(haystack + offset,
needle_char,
1,
haystack + haystack_len);
}
I’ll just quote what this does from the manual: “If needle is not a
string, it is converted to an integer and applied as the ordinal value
of a character.†This basically means that instead of writing strpos($str, 'A')
you could also write strpos($str, 65)
, because the ordinal value of A
is 65
.
If you look up at the variable declarations, you’ll see that needle_char
is declared as char needle_char[2]
, i.e. a string with two characters. php_needle_char
will put the actual character (in our example the A
) into needle_char[0]
. Then the strpos
code will set needle_char[1]
to 0
.
The reason behind this is that in C, strings are zero-terminated, i.e.
the last character is set to NUL (the character with the ordinal value 0
).
In the context of PHP this doesn’t make much sense, as PHP stores an
explicit length for all strings (so it does not need zero-termination to
find the end of a string), but this still is done in order to ensure
compatibility with the C functions used internally by PHP.
Zend functions
I’m getting tired of strpos
, so lets try to find another function: strlen
. We’ll do this using our usual approach:
Starting from the PHP 5.4 source code root try to search for strlen
.
You’ll see lots of unrelated uses of the function, so instead search for "PHP_FUNCTION strlen"
. While doing so, you’ll notice something strange though: There won’t be any results.
The reason is that strlen
is one of the few functions,
which is not defined by an extension, but by the Zend Engine itself. In
such cases the function is not defined as PHP_FUNCTION(strlen)
, but as ZEND_FUNCTION(strlen)
. Thus we also have to search for "ZEND_FUNCTION strlen"
instead.
As we already know, we have to click on the entry without a semicolon ;
at the end to get to the source code. This leads us to the following definition in Zend/zend_builtin_functions.c
:
ZEND_FUNCTION(strlen)
{
char *s1;
int s1_len;
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &s1, &s1_len) == FAILURE) {
return;
}
RETVAL_LONG(s1_len);
}
I don’t think that I have to further comment on this, as the function is so simple.
Methods
We’ll cover how classes and objects work in more detail in a
different post, but as a small peek ahead: You can search for class
methods by typing ClassName::methodName
into the search. As an example, try to search for SplFixedArray::getSize
.
In the next part
The next part will again be published on ircmaxell’s blog. It will cover what zval
s are, how they work and how they are used in the source code (all those Z_***
macros…)
Source:http://nikic.github.com/2012/03/16/Understanding-PHPs-internal-function-definitions.html