Ads 468x60px

freak2code is the blog about latest geek news,software reviews ,trends in technology and more informative stuff......

Monday, 13 August 2012

PHP: a fractal of bad design


PHP: a fractal of bad design




Preface

I’m cranky. I complain about a lot of things. There’s a lot in the world of technology I don’t like, and that’s really to be expected—programming is a hilariously young discipline, and none of us have the slightest clue what we’re doing. Combine with Sturgeon’s Law, and I have a lifetime’s worth of stuff to gripe about.
This is not the same. PHP is not merely awkward to use, or ill-suited for what I want, or suboptimal, or against my religion. I can tell you all manner of good things about languages I avoid, and all manner of bad things about languages I enjoy. Go on, ask! It makes for interesting conversation.
PHP is the lone exception. Virtually every feature in PHP is broken somehow. The language, the framework, the ecosystem, are all just bad. And I can’t even point out any single damning thing, because the damage is so systemic. Every time I try to compile a list of PHP gripes, I get stuck in this depth-first search discovering more and more appalling trivia. (Hence, fractal.)
PHP is an embarrassment, a blight upon my craft. It’s so broken, but so lauded by every empowered amateur who’s yet to learn anything else, as to be maddening. It has paltry few redeeming qualities and I would prefer to forget it exists at all.
But I’ve got to get this out of my system. So here goes, one last try.

An analogy

I just blurted this out to Mel to explain my frustration and she insisted that I reproduce it here.
I can’t even say what’s wrong with PHP, because— okay. Imagine you have uh, a toolbox. A set of tools. Looks okay, standard stuff in there.
You pull out a screwdriver, and you see it’s one of those weird tri-headed things. Okay, well, that’s not very useful to you, but you guess it comes in handy sometimes.
You pull out the hammer, but to your dismay, it has the claw part on both sides. Still serviceable though, I mean, you can hit nails with the middle of the head holding it sideways.
You pull out the pliers, but they don’t have those serrated surfaces; it’s flat and smooth. That’s less useful, but it still turns bolts well enough, so whatever.
And on you go. Everything in the box is kind of weird and quirky, but maybe not enough to make itcompletely worthless. And there’s no clear problem with the set as a whole; it still has all the tools.
Now imagine you meet millions of carpenters using this toolbox who tell you “well hey what’s the problem with these tools? They’re all I’ve ever used and they work fine!” And the carpenters show you the houses they’ve built, where every room is a pentagon and the roof is upside-down. And you knock on the front door and it just collapses inwards and they all yell at you for breaking their door.
That’s what’s wrong with PHP.

Stance

I assert that the following qualities are important for making a language productive and useful, and PHP violates them with wild abandon. If you can’t agree that these are crucial, well, I can’t imagine how we’ll ever agree on much.
  • A language must be predictable. It’s a medium for expressing human ideas and having a computer execute them, so it’s critical that a human’s understanding of a program actually be correct.
  • A language must be consistent. Similar things should look similar, different things different. Knowing part of the language should aid in learning and understanding the rest.
  • A language must be concise. New languages exist to reduce the boilerplate inherent in old languages. (We could all write machine code.) A language must thus strive to avoid introducing new boilerplate of its own.
  • A language must be reliable. Languages are tools for solving problems; they should minimize any new problems they introduce. Any “gotchas” are massive distractions.
  • A language must be debuggable. When something goes wrong, the programmer has to fix it, and we need all the help we can get.
My position is thus:
  • PHP is full of surprises: mysql_real_escape_stringE_ALL
  • PHP is inconsistent: strposstr_rot13
  • PHP requires boilerplate: error-checking around C API calls, ===
  • PHP is flaky: ==foreach ($foo as &$bar)
  • PHP is opaque: no stack traces by default or for fatals, complex error reporting
I can’t provide a paragraph of commentary for every single issue explaining why it falls into these categories, or this would be endless. I trust the reader to, like, think.

Don’t comment with these things

I’ve been in PHP arguments a lot. I hear a lot of very generic counter-arguments that are really only designed to halt the conversation immediately. Don’t pull these on me, please. :(
  • Do not tell me that “good developers can write good code in any language”, or bad developers blah blah. That doesn’t mean anything. A good carpenter can drive in a nail with either a rock or a hammer, but how many carpenters do you see bashing stuff with rocks? Part of what makes a good developer is the ability to choose the tools that work best.
  • Do not tell me that it’s the developer’s responsibility to memorize a thousand strange exceptions and surprising behaviors. Yes, this is necessary in any system, because computers suck. That doesn’t mean there’s no upper limit for how much zaniness is acceptable in a system. PHP is nothing but exceptions, and it is not okay when wrestling the language takes more effort than actually writing your program. My tools should not create net positive work for me to do.
  • Do not tell me “that’s how the C API works”. What on Earth is the point of using a high-level language if all it provides are some string helpers and a ton of verbatim C wrappers? Just write C! Here, there’s even aCGI library for it.
  • Do not tell me “that’s what you get for doing weird things”. If two features exist, someday, someone will find a reason to use them together. And again, this isn’t C; there’s no spec, there’s no need for “undefined behavior”.
  • Do not tell me that Facebook and Wikipedia are built in PHP. I’m aware! They could also be written in Brainfuck, but as long as there are smart enough people wrangling the things, they can overcome problems with the platform. For all we know, development time could be halved or doubled if these products were written in some other language; this data point alone means nothing.
  • Ideally, don’t tell me anything! This is my one big shot; if this list doesn’t hurt your opinion of PHP,nothing ever will, so stop arguing with some dude on the Internet and go make a cool website in record time to prove me wrong :)
Side observation: I loooove Python. I will also happily talk your ear off complaining about it, if you really want me to. I don’t claim it’s perfect; I’ve just weighed its benefits against its problems and concluded it’s the best fit for things I want to do.
And I have never met a PHP developer who can do the same with PHP. But I’ve bumped into plenty who are quick to apologize for anything and everything PHP does. That mindset is terrifying.

PHP

Core language

CPAN has been called the “standard library of Perl”. That doesn’t say much about Perl’s standard library, but it makes the point that a solid core can build great things.

Philosophy

  • PHP was originally designed explicitly for non-programmers (and, reading between the lines, non-programs); it has not well escaped its roots. A choice quote from the PHP 2.0 documentation, regarding +and friends doing type conversion:
    Once you start having separate operators for each type you start making the language much more complex. ie. you can’t use ‘==’ for stings [sic], you now would use ‘eq’. I don’t see the point, especially for something like PHP where most of the scripts will be rather simple and in most cases written by non-programmers who want a language with a basic logical syntax that doesn’t have too high a learning curve.
  • PHP is built to keep chugging along at all costs. When faced with either doing something nonsensical or aborting with an error, it will do something nonsensical. Anything is better than nothing.
  • There’s no clear design philosophy. Early PHP was inspired by Perl; the huge stdlib with “out” params is from C; the OO parts are designed like C++ and Java.
  • PHP takes vast amounts of inspiration from other languages, yet still manages to be incomprehensible to anyone who knows those languages. (int) looks like C, but int doesn’t exist. Namespaces use \. The new array syntax results in [key => value], unique among every language with hash literals.
  • Weak typing (i.e., silent automatic conversion between strings/numbers/et al) is so complex that whatever minor programmer effort is saved is by no means worth it.
  • Little new functionality is implemented as new syntax; most of it is done with functions or things that look like functions. Except for class support, which deserved a slew of new operators and keywords.
  • Some of the problems listed on this page do have first-party solutions—if you’re willing to pay Zend for fixes to their open-source programming language.
  • There is a whole lot of action at a distance. Consider this code, taken from the PHP docs somewhere.
      @fopen('http://example.com/not-existing-file', 'r');
    
    What will it do?
    • If PHP was compiled with --disable-url-fopen-wrapper, it won’t work. (Docs don’t say what “won’t work” means; returns null, throws exception?) Note that this flag was removed in PHP 5.2.5.
    • If allow_url_fopen is disabled in php.ini, this still won’t work. (How? No idea.)
    • Because of the @, the warning about the non-existent file won’t be printed.
    • But it will be printed if scream.enabled is set in php.ini.
    • Or if scream.enabled is set manually with ini_set.
    • But not if the right error_reporting level isn’t set.
    • If it is printed, exactly where it goes depends on display_errors, again in php.ini. Or ini_set.
    I can’t tell how this innocuous function call will behave without consulting compile-time flags, server-wide configuration, and configuration done in my program. And this is all built in behavior.
  • The language is full of global and implicit state. mbstring uses a global character set. func_get_arg and friends look like regular functions, but operate on the currently-executing function. Error/exception handling have global defaults. register_tick_function sets a global function to run every tick—what?!
  • There is no threading support whatsoever. (Not surprising, given the above.) Combined with the lack of built-in fork (mentioned below), this makes parallel programming extremely difficult.
  • Parts of PHP are practically designed to produce buggy code.
    • json_decode returns null for invalid input, even though null is also a perfectly valid object for JSON to decode to—this function is completely unreliable unless you also call json_last_error every time you use it.
    • array_searchstrpos, and similar functions return 0 if they find the needle at position zero, but false if they don’t find it at all.
    Let me expand on that last part a bit.
    In C, functions like strpos return -1 if the item isn’t found. If you don’t check for that case and try to use that as an index, you’ll hit junk memory and your program will blow up. (Probably. It’s C. Who the fuck knows. I’m sure there are tools for this, at least.)
    In, say, Python, the equivalent .index methods will raise an exception if the item isn’t found. If you don’t check for that case, your program will blow up.
    In PHP, these functions return false. If you use FALSE as an index, or do much of anything with it except compare with ===, PHP will silently convert it to 0 for you. Your program will not blow up; it will, instead, do the wrong thing with no warning, unless you remember to include the right boilerplate around every place you use strpos and certain other functions.
    This is bad! Programming languages are tools; they’re supposed to work with me. Here, PHP has actively created a subtle trap for me to fall into, and I have to be vigilant even with such mundane things as string operations and equality comparison. PHP is a minefield.
I have heard a great many stories about the PHP interpreter and its developers from a great many places. These are from people who have worked on the PHP coredebugged PHP core, interacted with core developers. Not a single tale has been a compliment.
So I have to fit this in here, because it bears repeating: PHP is a community of amateurs. Very few people designing it, working on it, or writing code in it seem to know what they’re doing. (Oh, dear reader, you are of course a rare exception!) Those who do grow a clue tend to drift away to other platforms, reducing the average competence of the whole. This, right here, is the biggest problem with PHP: it is absolutely the blind leading the blind.
Okay, back to facts.

Operators

  • == is useless.
    • It’s not transitive. "foo" == TRUE, and "foo" == 0… but, of course, TRUE != 0.
    • == converts to numbers when possible (123 == "123foo"… although "123" != "123foo"), which means it converts to floats when possible. So large hex strings (like, say, password hashes) may occasionally compare true when they’re not. Even JavaScript doesn’t do this.
    • For the same reason, "6" == " 6""4.2" == "4.20", and "133" == "0133". But note that 133 != 0133, because 0133 is octal. But "0x10" == "16" and "1e3" == "1000"!
    • === compares values and type… except with objects, where === is only true if both operands are actually the same object! For objects, == compares both value (of every attribute) and type, which is what === does for every other type. What.
  • Comparison isn’t much better.
    • It’s not even consistent: NULL < -1and NULL == 0. Sorting is thus nondeterministic; it depends on the order in which the sort algorithm happens to compare elements.
    • The comparison operators try to sort arrays, two different ways: first by length, then by elements. If they have the same number of elements but different sets of keys, though, they are uncomparable.
    • Objects compare as greater than anything else… except other objects, which they are neither less than nor greater than.
    • For a more type-safe ==, we have ===. For a more type-safe <, we have… nothing. "123" < "0124", always, no matter what you do. Casting doesn’t help, either.
  • Despite the craziness above, and the explicit rejection of Perl’s pairs of string and numeric operators, PHP does not overload ++ is always addition, and . is always concatenation.
  • The [] indexing operator can also be spelled {}.
  • [] can be used on any variable, not just strings and arrays. It returns null and issues no warning.
  • [] cannot slice; it only retrieves individual elements.
  • foo()[0] is a syntax error. (Fixed in PHP 5.4.)
  • Unlike (literally!) every other language with a similar operator, ?: is left associative. So this:
      $arg = 'T';
      $vehicle = ( ( $arg == 'B' ) ? 'bus' :
                   ( $arg == 'A' ) ? 'airplane' :
                   ( $arg == 'T' ) ? 'train' :
                   ( $arg == 'C' ) ? 'car' :
                   ( $arg == 'H' ) ? 'horse' :
                   'feet' );
      echo $vehicle;
    
    prints horse.

Variables

  • There is no way to declare a variable. Variables that don’t exist are created with a null value when first used.
  • Global variables need a global declaration before they can be used. This is a natural consequence of the above, so it would be perfectly reasonable, except that globals can’t even be read without an explicit declaration—PHP will quietly create a local with the same name, instead. I’m not aware of another language with similar scoping issues.
  • There are no references. What PHP calls references are really aliases; there’s nothing that’s a step back, like Perl’s references, and there’s no pass-by-object identity like in Python.
  • “Referenceness” infects a variable unlike anything else in the language. PHP is dynamically-typed, so variables generally have no type… except references, which adorn function definitions, variable syntax, and assignment. Once a variable is made a reference (which can happen anywhere), it’s stuck as a reference. There’s no obvious way to detect this and un-referencing requires nuking the variable entirely.
  • Okay, I lied. There are “SPL types” which also infect variables: $x = new SplBool(true); $x = "foo";will fail. This is like static typing, you see.
  • A reference can be taken to a key that doesn’t exist within an undefined variable (which becomes an array). Using a non-existent array normally issues a notice, but this does not.
  • Constants are defined by a function call taking a string; before that, they don’t exist. (This may actually be a copy of Perl’s use constant behavior.)
  • Variable names are case-sensitive. Function and class names are not. This includes method names, which makes camelCase a strange choice for naming.

Constructs

  • array() and a few dozen similar constructs are not functions. array on its own means nothing, $func = "array"; $func(); doesn’t work.
  • Array unpacking can be done with the list($a, $b) = ... operation. list() is function-like syntax just like array. I don’t know why this wasn’t given real dedicated syntax, or why the name is so obviously confusing.
  • (int) is obviously designed to look like C, but it’s a single token; there’s nothing called int in the language. Try it: not only does var_dump(int) not work, it throws a parse error because the argument looks like the cast operator.
  • (integer) is a synonym for (int). There’s also (bool)/(boolean) and (float)/(double)/(real).
  • There’s an (array) operator for casting to array and an (object) for casting to object. That sounds nuts, but there’s almost a use: you can use (array) to have a function argument that’s either a single item or a list, and treat it identically. Except you can’t do that reliably, because if someone passes a singleobject, casting it to an array will actually produce an array containing that object’s attributes. (Casting to object performs the reverse operation.)
  • include() and friends are basically C’s #include: they dump another source file into yours. There is no module system, even for PHP code.
  • There’s no such thing as a nested or locally-scoped function or class. They’re only global. Including a file dumps its variables into the current function’s scope (and gives the file access to your variables), but dumps functions and classes into global scope.
  • Appending to an array is done with $foo[] = $bar.
  • echo is a statement-y kind of thing, not a function.
  • empty($var) is so extremely not-a-function that anything but a variable, e.g. empty($var || $var2), is a parse error. Why on Earth does the parser need to know about empty?
  • There’s redundant syntax for blocks: if (...): ... endif;, etc.

Error handling

  • PHP’s one unique operator is @ (actually borrowed from DOS), which silences errors.
  • PHP errors don’t provide stack traces. You have to install a handler to generate them. (But you can’t for fatal errors—see below.)
  • PHP parse errors generally just spew the parse state and nothing more, making a forgotten quote terrible to debug.
  • PHP’s parser refers to e.g. :: internally as T_PAAMAYIM_NEKUDOTAYIM, and the << operator as T_SL. I say “internally”, but as above, this is what’s shown to the programmer when :: or << appears in the wrong place.
  • Most error handling is in the form of printing a line to a server log nobody reads and carrying on.
  • E_STRICT is a thing, but it doesn’t seem to actually prevent much and there’s no documentation on what it actually does.
  • E_ALL includes all error categories—except E_STRICT. (Fixed in 5.4.)
  • Weirdly inconsistent about what’s allowed and what isn’t. I don’t know how E_STRICT applies here, but these things are okay:
    • Trying to access a non-existent object property, i.e., $foo->x. (warning)
    • Using a variable as a function name, or variable name, or class name. (silent)
    • Trying to use an undefined constant. (notice)
    • Trying to access a property of something that isn’t an object. (notice)
    • Trying to use a variable name that doesn’t exist. (notice)
    • 2 < "foo" (silent)
    • foreach (2 as $foo); (warning)
    And these things are not:
    • Trying to access a non-existent class constant, i.e., $foo::x. (fatal error)
    • Using a constant string as a function name, or variable name, or class name. (parse error)
    • Trying to call an undefined function. (fatal error)
    • Leaving off a semicolon on the last statement in a block or file. (parse error)
    • Using list and various other quasi-builtins as method names. (parse error)
    • Subscripting the return value of a function, i.e., foo()[0]. (parse error; okay in 5.4, see above)
    There are a good few examples of other weird parse errors elsewhere in this list.
  • The __toString method can’t throw exceptions. If you try, PHP will… er, throw an exception. (Actually a fatal error, which would be passable, except…)
  • PHP errors and PHP exceptions are completely different beasts. They don’t seem to interact at all.
    • PHP errors (internal ones, and calls to trigger_error) cannot be caught with try/catch.
    • Likewise, exceptions do not trigger error handlers installed by set_error_handler.
    • Instead, there’s a separate set_exception_handler which handles uncaught exceptions, because wrapping your program’s entry point in a try block is impossible in the mod_php model.
    • Fatal errors (e.g., new ClassDoesntExist()) can’t be caught by anything. A lot of fairly innocuous things throw fatal errors, forcibly ending your program for questionable reasons. Shutdown functions still run, but they can’t get a stack trace (they run at top-level), and they can’t easily tell if the program exited due to an error or running to completion.
  • There is no finally construct, making wrapper code (set handler, run code, unset handler; monkeypatch, run a test, unmonkeypatch) tedious and difficult to write. Despite that OO and exceptions were largely copied from Java, this is deliberate, because finally “doesn’t make much sense in the context of PHP”. Huh?

Functions

  • Function calls are apparently rather expensive.
  • Some built-in functions interact with reference-returning functions in, er, a strange way.
  • As mentioned elsewhere, a lot of things that look like functions or look like they should be functions are actually language constructs, so nothing that works with functions will work with them.
  • Function arguments can have “type hints”, which are basically just static typing. But you can’t require that an argument be an int or string or object or other “core” type, even though every builtin function uses this kind of typing, probably because int is not a thing in PHP. (See above about (int).) You also can’t use the special pseudo-type decorations used heavily by builtin functions: mixednumber, orcallback. (callable is allowed as of PHP 5.4.)
    • As a result, this:
        function foo(string $s) {}
      
        foo("hello world");
      
      produces the error:
        PHP Catchable fatal error:  Argument 1 passed to foo() must be an instance of string, string given, called in...
      
    • You may notice that the “type hint” given doesn’t actually have to exist; there is no string class in this program. If you try to use ReflectionParameter::getClass() to examine the type hint dynamically, then it will balk that the class doesn’t exist, making it impossible to actually retrieve the class name.
    • A function’s return value can’t be hinted.
  • Passing the current function’s arguments to another function (dispatch, not uncommon) is done bycall_user_func_array('other_function', func_get_args()). But func_get_args throws a fatal error at runtime, complaining that it can’t be a function parameter. How and why is this even a type of error? (Fixed in PHP 5.3.)
  • Closures require explicitly naming every variable to be closed-over. Why can’t the interpreter figure this out? Kind of hamstrings the whole feature. (Okay, it’s because using a variable ever, at all, creates it unless explicitly told otherwise.)
  • Closed-over variables are “passed” by the same semantics as other function arguments. That is, arrays and strings etc. will be “passed” to the closure by value. Unless you use &.
  • Because closed-over variables are effectively automatically-passed arguments and there are no nested scopes, a closure can’t refer to private methods, even if it’s defined inside a class. (Possibly fixed in 5.4? Unclear.)
  • No named arguments to functions. Actually explicitly rejected by the devs because it “makes for messier code”.
  • Function arguments with defaults can appear before function arguments without, even though the documentation points out that this is both weird and useless. (So why allow it?)
  • Extra arguments to a function are ignored (except with builtin functions, which raise an error). Missing arguments are assumed null.
  • “Variadic” functions require faffing about with func_num_argsfunc_get_arg, and func_get_args. There’s no syntax for such a thing.

OO

  • The procedural parts of PHP are designed like C, but the objectional (ho ho) parts are designed like Java. I cannot overemphasize how jarring this is. The class system is designed around the lower-level Java language which is naturally and deliberately more limited than PHP’s contemporaries, and I am baffled.
    • I’ve yet to find a global function that even has a capital letter in its name, yet important built-in classes use camelCase method names and have getFoo Java-style accessors.
    • Perl, Python, and Ruby all have some concept of “property” access via code; PHP has only the clunky __get and friends. (The documentation inexplicably refers to such special methods as “overloading”.)
    • Classes have something like variable declaration (var and const) for class attributes, whereas the procedural part of the language does not.
    • Despite the heavy influence from C++/Java, where objects are fairly opaque, PHP often treats objects like fancy hashes—for example, the default behavior of foreach ($obj as $key => $value) is to iterate over every accessible attribute of the object.
  • Classes are not objects. Any metaprogramming has to refer to them by string name, just like functions.
  • Built-in types are not objects and (unlike Perl) can in no way be made to look like objects.
  • instanceof is an operator, despite that classes were a late addition and most of the language is built on functions and function-ish syntax. Java influence? Classes not first-class? (I don’t know if they are.)
    • But there is an is_a function. With an optional argument specifying whether to allow the object to actually be a string naming a class.
    • get_class is a function; there’s no typeof operator. Likewise is_subclass_of.
    • This doesn’t work on builtin types, though (again, int is not a thing). For that, you need is_intetc.
    • Also the right-hand side has to be a variable or literal string; it can’t be an expression. That causes… a parse error.
  • clone is an operator?!
  • Object attributes are $obj->foo, but class attributes are Class::$foo. ($obj::$foo will try to stringify$obj and use it as a class name.) Class attributes can’t be accessed via objects; the namespaces are completely separate, making class attributes completely useless for polymorphism. Class methods, of course, are exempt from this rule and can be called like any other method. (I am told C++ also does this. C++ is not a good example of fine OO.)
  • Also, an instance method can still be called statically (Class::method()). If done so from another method, this is treated like a regular method call on the current $this. I think.
  • newprivatepublicprotectedstatic, etc. Trying to win over Java developers? I’m aware this is more personal taste, but I don’t know why this stuff is necessary in a dynamic language—in C++ most of it’s about compilation and compile-time name resolution.
  • PHP has first-class support for “abstract classes”, which are classes that cannot be instantiated. Code in similar languages achieves this by throwing an exception in the constructor.
  • Subclasses cannot override private methods. Subclass overrides of public methods can’t even see, let alone call, the superclass’s private methods. Problematic for, say, test mocks.
  • Methods cannot be named e.g. “list”, because list() is special syntax (not a function) and the parser gets confused. There’s no reason this should be ambiguous, and monkeypatching the class works fine. ($foo->list() is not a syntax error.)
  • If an exception is thrown while evaluating a constructor’s arguments (e.g., new Foo(bar()) and bar()throws), the constructor won’t be called, but the destructor will be. (This is fixed in PHP 5.3.)
  • Exceptions in __autoload and destructors cause fatal errors. (Fixed in PHP 5.3.6. So now a destructor might throw an exception literally anywhere, since it’s called the moment the refcount drops the zero. Hmm.)
  • There are no constructors or destructors. __construct is an initializer, like Python’s __init__. There is no method you can call on a class to allocate memory and create an object.
  • There is no default initializer. Calling parent::__construct() if the superclass doesn’t define its own__construct is a fatal error.
  • OO brings with it an iterator interface that parts of the language (e.g., for...as) respect, but nothing built-in (like arrays) actually implements the interface. If you want an array iterator, you have to wrap it in an ArrayIterator. There are no built-in ways to chain or slice or otherwise work with iterators as first-class objects.
  • Interfaces like Iterator reserve a good few unprefixed method names. If you want your class to be iterable (without the default behavior of iterating all of its attributes), but want to use a common method name like key or next or current, well, too bad.
  • Classes can overload how they convert to strings and how they act when called, but not how they convert to numbers or any other builtin type.
  • Strings, numbers, and arrays all have a string conversion; the language relies heavily on this. Functions and classes are strings. Yet trying to convert a built-in or user-defined object (even a Closure) to a string causes an error if it doesn’t define __toString. Even echo becomes potentially error-prone.
  • There is no overloading for equality or ordering.
  • Static variables inside instance methods are global; they share the same value across all instances of the class.

Standard library

Perl is “some assembly required”. Python is “batteries included”. PHP is “kitchen sink, but it’s from Canada andboth faucets are labeled C”.

General

  • There is no module system. You can compile PHP extensions, but which ones are loaded is specified by php.ini, and your options are for an extension to exist (and inject its contents into your global namespace) or not.
  • As namespaces are a recent feature, the standard library isn’t broken up at all. There are thousands of functions in the global namespace.
  • Chunks of the library are wildly inconsistent from one another.
    • Underscore versus not: strpos/str_rot13php_uname/phpversionbase64_encode/urlencode,gettype/get_class
    • “to” versus 2: ascii2ebcdicbin2hexdeg2radstrtolowerstrtotime
    • Object+verb versus verb+object: base64_decodestr_shufflevar_dump versuscreate_functionrecode_string
    • Argument order: array_filter($input, $callback) versus array_map($callback, $input),strpos($haystack, $needle) versus array_search($needle, $haystack)
    • Prefix confusion: usleep versus microtime
    • Case insensitive functions vary on where the i goes in the name.
    • About half the array functions actually start with array_. The others do not.
    • htmlentities and html_entity_decode are inverses of each other, with completely different naming conventions.
  • Kitchen sink. The libary includes:
    • Bindings to ImageMagick, bindings to GraphicsMagick (which is a fork of ImageMagick), and a handful of functions for inspecting EXIF data (which ImageMagick can already do).
    • Functions for parsing bbcode, a very specific kind of markup used by a handful of particular forum packages.
    • Way too many XML packages. DOM (OO), DOM XML (not), libxmlSimpleXML, “XML Parser”,XMLReader/XMLWriter, and half a dozen more acronyms I can’t identify. There’s surely some kind of difference between these things and you are free to go figure out what that is.
    • Bindings for two particular credit card processors, SPPLUS and MCVE. What?
    • Three ways to access a MySQL database: mysqlmysqli, and the PDO abstraction thing.

C influence

This deserves its own bullet point, because it’s so absurd yet permeates the language. PHP is a high-level, dynamically-typed programming language. Yet a massive portion of the standard library is still very thin wrappers around C APIs, with the following results:
  • “Out” parameters, even though PHP can return ad-hoc hashes or multiple arguments with little effort.
  • At least a dozen functions for getting the last error from a particular subsystem (see below), even though PHP has had exceptions for eight years.
  • Warts like mysql_real_escape_string, even though it has the same arguments as the brokenmysql_escape_string, just because it’s part of the MySQL C API.
  • Global behavior for non-global functionality (like MySQL). Using multiple MySQL connections apparently requires passing a connection handle on every function call.
  • The wrappers are really, really, really thin. For example, calling dba_nextkey without callingdba_firstkey will segfault.
  • There’s a set of ctype_* functions (e.g. ctype_alnum) that map to the C character-class detection functions of similar names, rather than, say, isupper.

Genericism

There is none. If a function might need to do two slightly different things, PHP just has two functions.
How do you sort backwards? In Perl, you might do sort { $b <=> $a }. In Python, you might do.sort(reverse=True). In PHP, there’s a separate function called rsort().
  • Functions that look up a C error: curl_errorjson_last_erroropenssl_error_string,imap_errorsmysql_errorxml_get_error_codebzerrordate_get_last_errors, others?
  • Functions that sort: array_multisortarsortasortksortkrsortnatsortnatcasesortsort,rsortuasortuksortusort
  • Functions that find text: eregeregimb_eregmb_eregipreg_matchstrstrstrchrstristr,strrchrstrposstriposstrrposstrriposmb_strposmb_strrpos, plus the variations that do replacements
  • There are a lot of aliases as well, which certainly doesn’t help matters: strstr/strchr,is_int/is_integer/is_longis_float/is_doublepos/currentsizeof/countchop/rtrim,implode/joindie/exittrigger_error/user_errordiskfreespace/disk_free_space
  • scandir returns a list of files within a given directory. Rather than (potentially usefully) return them in directory order, the function returns the files already sorted. And there’s an optional argument to get them in reverse alphabetical order. There were not, apparently, enough sort functions. (PHP 5.4 adds a third value for the sort-direction argument that will disable sorting.)
  • str_split breaks a string into chunks of equal length. chunk_split breaks a string into chunks of equal length, then joins them together with a delimiter.
  • Reading archives requires a separate set of functions depending on the format. There are six separate groups of such functions, all with different APIs, for bzip2, LZF, phar, rar, zip, and gzip/zlib.
  • Because calling a function with an array as its arguments is so awkward (call_user_func_array), there are some pairings like printf/vprintf and sprintf/vsprintf. These do the same things, but one function takes arguments and the other takes an array of arguments.

Text

  • preg_replace with the /e (eval) flag will do a string replace of the matches into the replacement string,then eval it.
  • strtok is apparently designed after the equivalent C function, which is already a bad idea for various reasons. Nevermind that PHP can easily return an array (whereas this is awkward in C), or that the very hack strtok(3) uses (modifying the string in-place) isn’t used here.
  • parse_str parses a query string, with no indication of this in the name. Also it acts just likeregister_globals and dumps the query into your local scope as variables, unless you pass it an array to populate. (It returns nothing, of course.)
  • explode refuses to split with an empty/missing delimiter. Every other string split implementation anywhere does some useful default in this case; PHP instead has a totally separate function, confusingly called str_split and described as “converting a string to an array”.
  • For formatting dates, there’s strftime, which acts like the C API and respects locale. There’s also date, which has a completely different syntax and only works with English.
  • gzgetss — Get line from gz-file pointer and strip HTML tags.” I’m dying to know the series of circumstances that led to this function’s conception.
  • mbstring
    • It’s all about “multi-byte”, when the problem is character sets.
    • Still operates on regular strings. Has a single global “default” character set. Some functions allow specifying charset, but then it applies to all arguments and the return value.
    • Provides ereg_* functions, but those are deprecated. preg_* are out of luck, though they can understand UTF-8 by feeding them some PCRE-specific flag.

System and reflection

  • There are, in general, a whole lot of functions that blur the line between text and variables. compact andextract are just the tip of the iceberg.
  • There are several ways to actually be dynamic in PHP, and at a glance there are no obvious differences or relative benefits. classkit can modify user-defined classes; runkit supersedes it and can modify user-defined anything; the Reflection* classes can reflect on most parts of the language; there are a great many individual functions for reporting properties of functions and classes. Are these subsystems independent, related, redundant?
  • get_class($obj) returns the object’s class name. get_class() returns the name of the class the function is being called in. Setting aside that this one function does two radically different things:get_class(null)… acts like the latter. So you can’t trust it on an arbitrary value. Surprise!
  • The stream_* classes allow for implementing custom stream objects for use with fopen and other fileish builtins. “tell” cannot be implemented for internal reasons. (Also there are A LOT of functions involved with this system.)
  • register_tick_function will accept a closure object. unregister_tick_function will not; instead it throws an error complaining that the closure couldn’t be converted to a string.
  • php_uname tells you about the current OS. Unless PHP can’t tell what it’s running on; then it tells you about the OS it was built on. It doesn’t tell you if this has happened.
  • fork and exec are not built in. They come with the pcntl extension, but that isn’t included by default.popen doesn’t provide a pid.
  • stat’s return value is cached.
  • session_decode is for reading an arbitrary PHP session string, but it only works if there’s an active session already. And it dumps the result into $_SESSION, rather than returning it.

Miscellany

  • curl_multi_exec doesn’t change curl_errno on error, but it does change curl_error.
  • mktime’s arguments are, in order: hour, minute, second, month, day, year.

Data manipulation

Programs are nothing more than big machines that chew up data and spit out more data. A great many languages are designed around the kinds of data they manipulate, from awk to Prolog to C. If a language can’t handle data, it can’t do anything.

Numbers

  • Integers are signed and 32-bit on 32-bit platforms. Unlike all of PHP’s contemporaries, there is no automatic bigint promotion. So you can end up with surprises like negative file sizes, and your math might work differently based on CPU architecture. Your only option for larger integers is to use the GMP or BC wrapper functions. (The developers have proposed adding a new, separate, 64-bit type. This is crazy.)
  • PHP supports octal syntax with a leading 0, so e.g. 012 will be the number ten. However, 08 becomes the number zero. The 8 (or 9) and any following digits disappear. 01c is a syntax error.
  • 0x0+2 produces 4. The parser considers the 2 as both part of the hex literal and a separate decimal literal, treating this as 0x002 + 20x0+0x2 displays the same problem. Strangely, 0x0 +2 is still 4, but0x0+ 2 is correctly 2. (This is fixed in PHP 5.4. But it’s also re-broken in PHP 5.4, with the new 0b literal prefix: 0b0+1 produces 2.)
  • pi is a function. Or there’s a constant, M_PI.
  • There is no exponentiation operator, only the pow function.

Text

  • No Unicode support. Only ASCII will work reliably, really. There’s the mbstring extension, mentioned above, but it kinda blows.
  • Which means that using the builtin string functions on UTF-8 text risks corrupting it.
  • Similarly, there’s no concept of e.g. case comparisons outside of ASCII. Despite the proliferation of case-insensitive versions of functions, not one of them will consider é equal to É.
  • You can’t quote keys in variable interpolation, i.e., "$foo['key']" is a syntax error. You can unquote it (which would generate a warning anywhere else!), or use ${...}/{$...}.
  • "${foo[0]}" is okay. "${foo[0][0]}" is a syntax error. Putting the $ on the inside is fine with both. Bad copy of similar Perl syntax (with radically different semantics)?

Arrays

Oh, man.
  • This one datatype acts as a list, ordered hash, ordered set, sparse list, and occasionally some strange combination of those. How does it perform? What kind of memory use will there be? Who knows? Not like I have other options, anyway.
  • => isn’t an operator. It’s a special construct that only exists inside array(...) and the foreachconstruct.
  • Negative indexing doesn’t work, since -1 is just as valid a key as 0.
  • Despite that this is the language’s only data structure, there is no shortcut syntax for it; array(...) isshortcut syntax. (PHP 5.4 is bringing “literals”, [...].)
  • The => construct is based on Perl, which allows foo => 1 without quoting. (That is, in fact, why it exists in Perl; otherwise it’s just a comma.) In PHP, you can’t do this without getting a warning; it’s the only language in its niche that has no vetted way to create a hash without quoting string keys.
  • Array functions often have confusing or inconsistent behavior because they have to operate on lists, hashes, or maybe a combination of the two. Consider array_diff, which “computers the difference of arrays”.
      $first  = array("foo" => 123, "bar" => 456);
      $second = array("foo" => 456, "bar" => 123);
      echo var_dump(array_diff($first, $second));
    
    What will this code do? If array_diff treats its arguments as hashes, then obviously these are different; the same keys have different values. If it treats them as lists, then they’re still different; the values are in the wrong order.
    In fact array_diff considers these equal, because it treats them like sets: it compares only values, and ignores order.
  • In a similar vein, array_rand has the strange behavior of selecting random keys, which is not that helpful for the most common case of needing to pick from a list of choices.
  • Despite how heavily PHP code relies on preserving key order:
      array("foo", "bar") != array("bar", "foo")
      array("foo" => 1, "bar" => 2) == array("bar" => 2, "foo" => 1)
    
    I leave it to the reader to figure out what happens if the arrays are mixed. (I don’t know.)
  • array_fill cannot create zero-length arrays; instead it will issue a warning and return false.
  • All of the (many…) sort functions operate in-place and return nothing. There is no way to create a new sorted copy; you have to copy the array yourself, then sort it, then use the array.
  • But array_reverse returns a new array.
  • A list of ordered things and some mapping of keys to values sounds kind of like a great way to handle function arguments, but no.

Not arrays

  • The standard library includes “Quickhash”, an OO implementation of “specific strongly-typed classes” for implementing hashes. And, indeed, there are four classes, each dealing with a different combination of key and value types. It’s unclear why the builtin array implementation can’t optimize for these extremely common cases, or what the relative performance is.
  • There’s an ArrayObject class (which implements five different interfaces) that can wrap an array and have it act like an object. User classes can implement the same interfaces. But it only has a handful of methods, half of which don’t resemble built-in array functions, and built-in array functions don’t know how to operate on an ArrayObject or other array-like class.

Functions

  • Functions are not data. Closures are actually objects, but regular functions are not. You can’t even refer to them with their bare names; var_dump(strstr) issues a warning and assumes you mean the literal string, "strstr". There is no way to discern between an arbitrary string and a function “reference”.
  • create_function is basically a wrapper around eval. It creates a function with a regular name and installs it globally (so it will never be garbage collected—don’t use in a loop!). It doesn’t actually know anything about the current scope, so it’s not a closure. The name contains a NUL byte so it can never conflict with a regular function (because PHP’s parser fails if there’s a NUL in a file anywhere).
  • Declaring a function named __lambda_func will break create_function—the actual implementation is toeval-create the function named __lambda_func, then internally rename it to the broken name. If__lambda_func already exists, the first part will throw a fatal error.

Other

  • Incrementing (++) a NULL produces 1. Decrementing (--) a NULL produces NULL. Decrementing a string likewise leaves it unchanged.
  • There are no generators.

Web framework

Execution

  • A single shared file, php.ini, controls massive parts of PHP’s functionality and introduces complex rules regarding what overrides what and when. PHP software that expects to be deployed on arbitrary machines has to override settings anyway to normalize its environment, which largely defeats the use of a mechanism like php.ini anyway.
    • PHP looks for php.ini in a variety of places, so it may (or may not…) be possible to override your host’s. Only one such file will ever be parsed, though, so you can’t just override a couple settings and call it a day.
  • PHP basically runs as CGI. Every time a page is hit, PHP recompiles the whole thing before executing it. Even dev servers for Python toy frameworks don’t act like this.
    This has led to a whole market of “PHP accelerators” that just compile once, accelerating PHP all the way to any other language. Zend, the company behind PHP, has made this part of their business model.
  • For quite a long time, PHP errors went to the client by default—I guess to help during development. I don’t think this is true any more, but I still see the occasional mysql error spew at the top of a page.
  • PHP is full of strange “easter eggs” like producing the PHP logo with the right query argument. Not only is this completely irrelevant to building your application, but it allows detecting whether you’re using PHP (and perhaps roughly guessing what version), regardless of how much mod_rewrite, FastCGI, reverse proxying, or Server: configuration you’re doing.
  • Blank lines before or after the <?php ... ?> tags, even in libraries, count as literal text and is interpolated into the response (or causes “headers already sent” errors). Your options are to either strictly avoid extra blank lines at the end of every file (the one after the ?> doesn’t count) or to just leave off the ?>closing token.

Deployment

Deployment is often cited as the biggest advantage of PHP: drop some files and you’re done. Indeed, that’s much easier than running a whole process as you may have to do with Python or Ruby or Perl. But PHP leaves plenty to be desired.
Across the board, I’m in favor of running Web applications as app servers and reverse-proxying to them. It takes minimal effort to set this up, and the benefits are plenty: you can manage your web server and app separately, you can run as many or few app processes on as many machines as you want without needing more web servers, you can run the app as a different user with zero effort, you can switch web servers, you can take down the app without touching the web server, you can do seamless deployment by just switching where a fifo points, etc. Welding your application to your web server is absurd and there’s no good reason to do it any more.
  • PHP is naturally tied to Apache. Running it separately, or with any other webserver, requires just as much mucking around (possibly more) as deploying any other language.
  • php.ini applies to every PHP application run anywhere. There is only one php.ini file, and it applies globally; if you’re on a shared server and need to change it, or if you run two applications that need different settings, you’re out of luck; you have to apply the union of all necessary settings and pare them down from inside the apps themselves using ini_set or in Apache’s configuration file or in .htaccess. If you can. Also wow that is a lot of places you need to check to figure out how a setting is getting its value.
  • Similarly, there is no easy way to “insulate” a PHP application and its dependencies from the rest of a system. Running two applications that require different versions of a library, or even PHP itself? Start by building a second copy of Apache.
  • The “bunch of files” approach, besides making routing a huge pain in the ass, also means you have to carefully whitelist or blacklist what stuff is actually available, because your URL hierarchy is also your entire code tree. Configuration files and other “partials” need C-like guards to prevent them from being loaded directly. Version control noise (e.g., .svn) needs protecting. With mod_phpeverything on your filesystem is a potential entry point; with an app server, there’s only one entry point, and only the URL controls whether it’s invoked.
  • You can’t seamlessly upgrade a bunch of files that run CGI-style, unless you want crashes and undefined behavior as users hit your site halfway through the upgrade.
  • Despite how “simple” it is to configure Apache to run PHP, there are some subtle traps even there. While the PHP docs suggest using SetHandler to make .php files run as PHP, AddHandler appears to work just as well, and in fact Google gives me twice as many results for it. Here’s the problem.
    When you use AddHandler, you are telling Apache that “execute this as php” is one possible way to handle .php files. But! Apache doesn’t have the same idea of file extensions that every human being on the planet does. It’s designed to support, say, index.html.en being recognized as both English and HTML. To Apache, a file can have any number of file extensions simultaneously.
    Imagine you have a file upload form that dumps files into some public directory. To make sure nobody uploads PHP files, you just check that they don’t have a .php extension. All an attacker has to do is upload a file named foo.php.txt; your uploader won’t see a problem, but Apache will recognize it as PHP, and it will happily execute.
    The problem here isn’t “using the original filename” or “not validating better”; the problem is that your web server is configured to run any old code it runs across—precisely the same property that makes PHP “easy to deploy”. CGI required +x, which was something, but PHP doesn’t even do that. And this is no theoretical problem; I’ve found multiple live sites with this issue.

Missing features

I consider all of these to be varying levels of critical for building a Web application. It seems reasonable that PHP, with its major selling point being that it’s a “Web language”, ought to have some of them.
  • No template system. There’s PHP itself, but nothing that acts as a big interpolator rather than a program.
  • No XSS filter. No, “remember to use htmlspecialchars” is not an XSS filter. This is.
  • No CSRF protection. You get to do it yourself.
  • No generic standard database API. Stuff like PDO has to wrap every individual database’s API to abstract the differences away.
  • No routing. Your website looks exactly like your filesystem. Many developers have been tricked into thinking mod_rewrite (and .htaccess in general) is an acceptable substitute.
  • No authentication or authorization.
  • No dev server. (“Fixed” in 5.4. Led to the Content-Length vuln below. Also, you have to port all your rewrite rules to a PHP wrapper thing, because there’s no routing.)
  • No interactive debugging.
  • No coherent deployment mechanism; only “copy all these files to the server”.

Security

Language boundaries

PHP’s poor security reputation is largely because it will take arbitrary data from one language and dump it into another. This is a bad idea. "<script>" may not mean anything in SQL, but it sure does in HTML.
Making this worse is the common cry for “sanitizing your inputs”. That’s completely wrong; you can’t wave a magic wand to make a chunk of data inherently “clean”. What you need to do is speak the language: use placeholders with SQL, use argument lists when spawning processes, etc.
  • PHP outright encourages “sanitizing”: there’s an entire data filtering extension for doing it.
  • All the addslashesstripslashes, and other slashes-related nonsense are red herrings that don’t help anything.
  • There is, as far as I can tell, no way to safely spawn a process. You can ONLY execute a string via the shell. Your options are to escape like crazy and hope the default shell uses the right escaping, orpcntl_fork and pcntl_exec manually.
  • Both escapeshellcmd and escapeshellarg exist with roughly similar descriptions. Note that on Windows, escapeshellarg does not work (because it assumes Bourne shell semantics), andescapeshellcmd just replaces a bunch of punctuation with spaces because nobody can figure out Windows cmd escaping (which may silently wreck whatever you’re trying to do).
  • The original built-in MySQL bindings, still widely-used, have no way to create prepared statements.
To this day, the PHP documentation on SQL injection recommends batty practices like type-checking, usingsprintf and is_numeric, manually using mysql_real_escape_string everywhere, or manually usingaddslashes everywhere (which “may be useful”!). There is no mention of PDO or paramaterization, except in the user comments. I complained about this very specifically to a PHP dev at least two years ago, he was alarmed, and the page has never changed.

Insecure-by-default

  • register_globals. It’s been off by default for a while by now, and it’s gone in 5.4. I don’t care. This is anembarrassment.
  • include accepting HTTP URLs. Likewise.
  • Magic quotes. So close to secure-by-default, and yet so far from understanding the concept at all. And, likewise.

Core

The PHP interpreter itself has had some fascinating security problems.
  • In 2007 the interpreter had an integer overflow vulnerability. The fix started with if (size > INT_MAX) return NULL; and went downhill from there. (For those not down with the C: INT_MAX is the biggest integer that will fit in a variable, ever. I hope you can figure out the rest from there.)
  • More recently, PHP 5.3.7 managed to include a crypt() function that would, in effect, let anyone log in with any password.
  • PHP 5.4’s dev server is vulnerable to a denial of service, because it takes the Content-Length header (which anyone can set to anything) and tries to allocate that much memory. This is a bad idea.
I could dig up more but the point isn’t that there are X many exploits—software has bugs, it happens, whatever. The nature of these is horrifying. And I didn’t seek these out; they just happened to land on my doorstep in the last few months.

Conclusion

Some commentary has rightfully pointed out that I don’t have a conclusion. And, well, I don’t have a conclusion. If you got all the way down here, I assumed you agreed with me before you started :)
If you only know PHP and you’re curious to learn something else, give the Python tutorial a whirl and try Flaskfor the web stuff. (I’m not a huge fan of its template language, but it does the job.) It breaks apart the pieces of your app, but they’re still the same pieces and should look familiar enough. I might write a real post about this later; a whirlwind introduction to an entire language and web stack doesn’t belong down here.
Later or for bigger projects you may want Pyramid, which is medium-level, or Django, which is a complex monstrosity that works well for building sites like Django’s.
If you’re not a developer at all but still read this for some reason, I will not be happy until everyone on the planet has gone through Learn Python The Hard Way so go do that.
There’s also Ruby with Rails and some competitors I’ve never used, and Perl is still alive and kicking with Catalyst. Read things, learn things, build things, go nuts.

Python FAQ: Webdev


Python FAQ: Webdev

I only know PHP. How do I write a Web application in Python?
This is a deeply complex question. I could easily fill a book on web development and Python and how to make the two interact, so I was hoping to put this one off for a while. But given that I just trashed PHP rather harshly, it seems prudent to answer it sooner rather than later.
The dead simple answer is to stop reading here, get Flask, and start building a thing. I prefer a bit more nuance, though.
This is not a tutorial. I may write one in the future, but for now, plenty of them already exist, and I assume you can read documentation. Instead, this is an overview of the current state of affairs for someone new to Python web development.

Getting started

Obviously, you’ll need to have Python installed. Be sure to use Python 2, not 3; Python 3 made some backwards-incompatible changes, and not all libraries have updated yet.
For installing Python libraries, consider pip. (If you’re on a Unixlike, you can probably get it from your package manager, or with easy_install pip.) pip is a little package manager for Python; it can easily install, remove, upgrade, and examine Python libraries. Use your system package manager whenever possible, of course, but use pip for everything else.
You can install Python libraries to your home directory with pip install --user ..., but it’s even better to keep libraries local to each project you work on—that way, you can upgrade dependencies for one project without potentially breaking all the others. (Or breaking system software written in Python. I have done this.)virtualenv helps with this by creating a separate Python installation with a single command.
And, of course, you’re already planning to use source control. Right? I like git, but anything is better than nothing at all.

Framework

The first hurdle is somehow connecting your code to a browser. In PHP, the simplest thing is to install Apache and point it at some files. In Python, as with larger PHP projects, you’ll generally do this with a web framework.
Frameworks all tend to have a similar workflow:
  1. Install it, with a tool like pip.
  2. Create a skeleton project.
    The complexity of the skeleton varies. In the now-defunct Pylons, you’d end up with a good chunk of somewhat-mysterious code that you had to manually upgrade for new releases. Flask is so simple that there is no skeleton. Somewhere in the middle is Pyramid, where a skeleton project is nothing more than some common boilerplate that you’d end up writing yourself if you started from scratch.
  3. Configure some things, like databases.
  4. Start up the development server.
    This tends to be a terminal program that runs your app without need for a dedicated HTTP server. It’ll reload your code when it changes, and spit out stack traces and other debugging information.
  5. Hack away!
What, then, should you use? There are zillions of options, but a few that are clearly the most popular.
I’m a fan of Pyramid, which hits a sweet spot somewhere between minimalism and batteries-included monolith. It’s a somewhat recent contender, but it evolved from two older and well-established projects; the result is well-designed, well-documented, and fairly transparent. A simple app needs no automatic boilerplate at all, there are skeletons to get you up and running, and the core library is overflowing with extension points. There’s a growing collection of helpful addons, as well.
For an even quicker start, Flask is about as simple as it gets, but has plenty of room to grow with crazy amounts of extensibility if you’re willing to build on top of it. It’s designed to do fairly reasonable things out of the box, without forcing much on you.
Bottle is similar to Flask, though arguably even simpler: it’s distributed as a single file and has zero dependencies. Whether this is good or bad is up to you, but it does mean that nothing in Bottle will be shared with any other framework, ever. Admittedly I don’t know too much about it, but I gave it a brief shot once and didn’t have any major complaints.
On the other end of the spectrum, Django is a massive beast designed for CMS-likes and other content-rich sites. It has a large ecosystem of pluggable components, built-in everythings from templates to an ORM, and piles of documentation and community resources. Django is generally cited as the Python equivalent to Ruby on Rails. The downside is that convincing it to do things it doesn’t want to do can be… awkward. (Many of the more obtuse questions in #python are caused by attempts to tinker with Django.) Possibly a little heavy for a first attempt.
web2py exists. I, er, don’t know much else about it. Allegedly it injects variables into your modules’ namespaces, and that’s gross, so don’t use it if you care what I think is gross. Or do. Whatever.
There used to be a mod_python Apache module that was spiritually similar to mod_perl, but it’s long since been abandoned. Please do not use it.
Lastly, you can write Python web code “manually”, but that’s largely an exercise in frustration. It’s not faster, it’s not educational, it’s not really very useful. Don’t bother.
My suggestion? If you just want to tinker, start with Flask and add on stuff as you go. If you have an idea for a site in mind and want to hit the ground running, use a Pyramid scaffold and follow along with its narrative documentation.

Routing

While PHP executes an entire file based on the URL, Python web applications tend to “own” an entire directory structure (or even the entire domain). Connecting particular URLs to particular code is thus a bit more flexible, and is usually handled by a routing system.
Routes are URLs with optional placeholders, like these:
/users/{name}
/companies/{id}/products
/blog/{year:\d\d\d\d}/{month:\d\d}/{day:\d\d}/{title}
You’d attach a route like this to a function. Then when you browse to /users/eevee, that function would be run, and the placeholders would be available in a structure like dict(name=u'eevee').
Some frameworks (like Pyramid) take this a step further: instead of attaching a route directly to a function, you give the route a name, and then attach the name to the function. It’s a little extra work, but the advantage is a central list of every page in your app. You can also use a route name and placeholder values to generate a URL—then, later, you can change a URL in a single place without touching anything else, and a typo in development will cause an error instead of a 404.
The syntax and exact implementation varies a little, but every framework uses some variation of this system. Some have helpers for creating RESTful routes or other common patterns, or you can easily write your own.

Request cycle

An HTTP request tends to run a function somewhere (chosen by the router) and pass it a request object.
The request object’s exact interface will depend on the particular framework, but they tend to be similar: parsed query data, cookies, headers, and so forth. As an example, take webob’s Request object, which includes:
  • request.GET and request.POST are “multidicts” of parsed query data. (A multidict returns a single value for request.GET['foo'], but exposes all values with a getall() method.)
  • request.params is a multidict combining both of the above.
  • request.cookies is a parsed dict of cookies.
  • request.headers is a dict of HTTP request headers, but with the keys treated as case-insensitive.
  • request.is_xhr returns whether the X-Requested-With: XMLHttpRequest header is present, to identify ajax requests from libraries like jQuery that set it.
Request objects tend to be pretty thoroughly documented, so just have a flip through the docs of your chosen framework and pick out the important stuff.
When your app is done doing whatever cool thing it does, you send back a response. You usually have the option of either explicitly constructing a Response object (including HTTP headers and other manual fiddling) or just returning a chunk of HTML and using the defaults for everything else. You very rarely need to create a response yourself; for common cases like returning JSON, every framework has some shortcut or helper decorator.

Templates

Assembling HTML tends to be done with a template engine. The two major contenders are Mako and Jinja2.
I really like Mako. Really, really, really. Go use it. It uses unadorned Python for its syntax, and manages to do so in a very natural way. You can even write blocks of pure Python within templates, though of course you should exercise restraint and do this as little as possible. :)
Jinja2 is okay, with a gruff caveat: foo.bar is treated as foo['bar'] if foo looks like a dict and vice versa. I happen to think this is a really bad idea, and have been bitten by numerous subtle problems it causes in multiple template systems with the same “feature”. (Also, the {% %} syntax is really noisy, but that’s splitting hairs.) That aside, Jinja2 is a plenty solid library and you could definitely do worse. Much, much worse.
Both of these tools are pretty speedy, automatically compile to Python modules behind the scenes, have excellent debugging (with crazy hacks to get stack traces from your original template source), and should be plenty powerful enough to do whatever you want. Have a glance over both, pick one, and get going. If you don’t know or care which to use, use Mako.
(Note that while Flask uses Jinja2 by default, it’s fairly easy to use Mako instead.)
There are some other contenders, of course: the third-place winner is probably Genshi, but it’s so incredibly convoluted that the homepage starts off with a flow diagram; Django has its own template engine that tries very hard to keep logic out of your templates (imo to its detriment); Bottle likewise has its own drop-dead-simple templates that will probably cause growing pains pretty quickly; Pyramid’s other builtin template engine is Chameleon, which uses HTML-ish attributes for loops and other logic, and that’s fucking batty.
Maybe you’ll like one of these; I haven’t used them non-trivially myself.
Whatever you do, do not use Cheetah. DO NOT use Cheetah. It is an unholy abomination. Let’s not speak of it further.

Logic in templates

Perhaps you haven’t used templates before. If so, you’ll inevitably run into the question of whether some complex rendering code should live in Python or live in your template.
This is an old and silly argument, but I will say this: like many architectural decisions in programming, it comes down to minimizing how much you’ll hate yourself for it later. Keep complexity out of your templates if you can, but don’t jump through hoops to avoid it if you can’t. Remember that you can always just write plain Python functions in plain Python modules and import them. A powerful template language might even have a creative solution to your problem built in; glance over the docs again while you’re thinking.

Unicode

Unicode sucks. This is a universal truth. (I’m lying. Dealing with encodings sucks. Unicode is great. It’s complicated. I’ll write about it later.)
Python (2) has two “string” types: str and unicode. There’s a clever lie here, and that is: a str isn’t really a string. It’s just a series of bytes. Sometimes that happens to look like a string, but it’s really just a binary representation, the same way 85 00 00 00 is a common binary representation of the number 133. A realnumber is an int, and a real string is a unicode.
The issue is complicated enough to deserve its own article (which I will totally write sooner or later), but some quick notes:
  • Your program should only worry about real strings (that is, unicodes) internally. You have to decode strings that come into your program and encode ones that leave, but luckily, most web frameworks will do that for you.
  • You can use the u prefix on a literal string to make it a unicode, e.g., u'foo'.
  • You can use from __future__ import unicode_literals at the top of a file to make all literal strings within that file be unicode by default. If you really really want a str, use a b prefix.
  • If you want to use non-ASCII characters in Python source code, add an #encoding: utf8 magic comment to the top. (Assuming of course that your source code is saved as UTF-8, which it had damn well better be.)
  • NEVER solve a Unicode problem by stripping out non-ASCII characters! That’s incredibly rude to a huge number of people; just imagine how you’d feel trying to use a website that decided you weren’t allowed to use English letters, because some programmer was too lazy to figure out how to handle them.
  • In fact, accented letters and Asian characters are great for shaking out encoding problems. Paste some non-ASCII gibberish into forms on your site and see what happens.

XSS

Virtually everything nowadays has some form of automatic HTML escaping filter built in. The idea is that a template like this:
<p>Hello, ${name}!</p>
will, when given name = '<b>', safely print out Hello, &lt;b&gt;!. This means that, for the most part, you don’t have to worry about XSS.
For the most part. If nothing else, you must check the docs for your framework and template engine to make sure this is turned on by default, or turn it on if it’s not. (Off the top of my head: you get it for free with Pyramid, Django, and Flask. Bottle does it automatically if your template file has an HTML-sounding extension.)
The tricky bit, then, is knowing when and how to turn it off. If you construct some complex HTML in Python code, you don’t want it all escaped when sticking it in your template. Merely disabling the escaping behavior is a crappy solution, though; anywhere you do it is a potential source of injection. Luckily, many frameworks (Pyramid and Flask, at least) use the markupsafe library, which cleverly helps avoid this problem.
markupsafe provides a single class, Markup, which inherits from unicodeMarkup(u'Hello!') will produce an object that acts pretty much like a string. The classmethod Markup.escape works the same way, but it escapes any HTML characters in the wrapped string.
There are two sneaky tricks here. The first is that a Markup object will never be escaped a second time. Observe:
1
2
3
4
5
>>> s = u'<b>oh noo xss</b>'
>>> Markup.escape(s)
Markup(u'&lt;b&gt;oh noo xss&lt;/b&gt;')
>>> Markup.escape(Markup.escape(s))
Markup(u'&lt;b&gt;oh noo xss&lt;/b&gt;')
So once you’ve created a Markup, you can feed it to your template, and the filter will leave it alone—even if it contains HTML.
The other trick is that Markup overrides every string method and automatically escapes all the arguments. That means you can do stuff like this in Python land:
1
2
3
>>> user_input = u'<script>alert("pwn");</script>'
>>> Markup(u'<p>Hello, %s!</p>') % user_input
Markup(u'<p>Hello, &lt;script&gt;alert(&#34;pwn&#34;);&lt;/script&gt;!</p>')
You can thus build complex HTML fairly safely, without worrying too much about underescaping or overescaping.
It’s not perfect, of course; the primary gotcha is that you need to use Markup().join(...) on a sequence of other Markup objects, not ''.join(...). And some operations like slicing, splitting, and regexes are likely to produce nonsensical results. Never try to decompose a Markup object or any other string of HTML; if you absolutely must, use a real parser like lxml, but in most cases you can do whatever transformation you need on a plain string before wrapping it in HTML.

Forms

I hate all form handling libraries. Every single one. They all enforce the author’s crazy naming scheme on my forms. I don’t even like the PHP behavior of using foo[] as a field name; that’s just so astoundingly ugly.
The one I hate the least so far is wtforms; it enforces fairly few design constraints and is pretty simple to use. It even has built-in support for working with markupsafe. The major downsides are that it’s difficult to remove those few design constraints (every form element gets an id attribute matching its name—argh!), and implementing a new kind of field can be a little complex.
I can’t speak much to any others, alas. FormEncode is a thing. Pyramid’s maintainers also own deform. They both do some dumb thing that bothers me for really nitpicky reasons. Shop around.
Whatever you do, just make sure you use something before your project grows too big. The one thing I hate more than form handling libraries is writing validation code by hand. :)

“Sanitizing”

A note on a common trend in PHP land.
Do not “sanitize”.
The word itself makes no sense. There is no process by which you can take an arbitrary string and make it “safe”. This kind of thinking is why I keep running into bank websites with contact forms that tell me I can’t use the < character; some numbskull enterprise developer doesn’t have a clue how to deal with data, so he just enforces that all data must be idiot-proof.
Don’t be an idiot.
Most of the time, “sanitizing” is referring to making user input “safe” to embed in HTML, pass to SQL, or use as a command-like argument. You can do all of these things without changing the original data at all. For HTML, there are filters like markupsafe, mentioned above. For SQL, there are bound parameters and ORMs. For running commands, you should avoid the shell entirely and just pass the arguments as a list (see subprocess).
These are all problems of language barriers: HTML, SQL, and shell are all structured languages, and you can’t just dump mystery junk into them and hope for the best. You wouldn’t use string concatenation to create JSON, so don’t do it to run convert. Use tools that understand the underlying structure.
This isn’t to say that you should never modify or filter user input, but you should avoid it whenever possible, and be damn careful when you do. For a common example of passwords, why is it so common to prohibit spaces in passwords or limit them to 16 characters? There’s no clear reason; it’s just a thing that’s done.
I’m still baffled by this one: the same places that cry foul when I try to type a < also insist that I type my credit card number as a solid string of sixteen digits. That makes it really hard to verify at a glance that I typed it correctly—and besides, the number on my card has spaces in it! Why not strip spaces and hyphens?
Just think carefully about what you’re doing and what problem you’re trying to solve. Are people using Unicode right-to-left characters to do dumb things to your site, and you want to stop them? No reason to force everyone to use ASCII; Unicode has categories, and you could just filter out characters in the weirder categories. Better yet, just fix your website so people who speak Hebrew can use it. :)

Debugging

If you’re lucky (i.e., using Pyramid), when your program crashes, you’ll get an interactive debugger that lets you examine the live state of your program. You can run arbitrary Python code, look at the state of variables, walk the stack, and otherwise screw around.
If you’re unlucky, don’t worry; you can still get this by using the werkzeug debugger. It’s pretty simple to use; it wraps any WSGI application and catches exceptions. (See? WSGI is awesome.)
Just make sure you don’t leave debugging on when deploying your app or otherwise sharing it with others; “arbitrary Python code” means anyone seeing the debug screen can do anything to your computer that you can.

Databases

What a can of worms. This is as opinionated as I’m going to get.
For one: you should use an ORM. That’s a thingy that tries valiantly to map database tables to Python classes, rows to objects, and queries to method calls. The result is more concise, often easier to understand, and sometimes even more correct.
The ORM you should use is SQLAlchemy. Pyramid has some builtin support for it; if you’re using a framework that doesn’t, SQLAlchemy is popular enough that the framework documentation assuredly has instructions for wiring it in. If you’re using Django, it has its own ORM; it’s not as good as SQLAlchemy, but replacing it is more of a bother than it’s worth unless you have a compelling need.
Many detractors will tell you that ORMs generate bad SQL. Yes, bad ORMs will do this. Good ORMs, like SQLAlchemy, understand SQL as well as you do. If you understand SQL, SQLAlchemy will be great for you; if you don’t understand SQL, SQLAlchemy will at least save you some embarrassment by writing bad SQL in your stead. Remember that you can always look at the queries being run; SQLAlchemy can log them all, and various debug toolbars will show a list with execution times. (Also keep an eye out for the same query being run many times; that’s a sign you want some eagerloading.)
Next, use transactions. You hopefully don’t have to think about this too much; if a framework has any SQLAlchemy integration at all, it probably does this for you. The idea is that a transaction starts when a request starts, and it’s automatically rolled back if there’s an exception. This is behavior you want from the start! It’s half (err, ¼) the point of using a database.
One more thing: since this article is all about trying new things based on what I say, do not use MySQL. In every sense I can imagine, MySQL is the PHP of databases. Give PostgreSQL a spin; it’s no harder to set up, it’s nicer to use, and it won’t let you do dumb things like store strings in date columns. (One of the nicest things, imo, is that Postgres can use your Unix user account as a login; no passwords required.) The only argument anyone ever has against using Postgres is that it “doesn’t scale”; rest assured I’ve yet to see an actual demonstration of that, and either way you can worry about it when you have a million more visitors.

Sessions

Every framework has session support. It’ll look familiar: a session token is stored in a cookie, and on the backend you magically get a dict that you can store arbitrary junk in. Use this as you will. Try not to use it as a dumping ground; it turns out databases are pretty good for, y’know, storing data.
Bonus features include first-class support for CSRF protection (Pyramid, Django has a module) and flash messages (Pyramid, Flask, Django). Go read your docs.
One word of warning: if you’re using Beaker sessions (Pyramid), they tend to accumulate cruft. By default this is in the form of a file on disk for every session ever, but if you use db-backed sessions, you’ll end up with a massive sessions table that never shrinks. This is a terrible and non-obvious problem, and the fixes are all basically manual. Sorry.

Deployment

Ah, you got me. There are a lot of ways to deploy, and they deserve more screen time than I can really devote here.
If possible, be willing to spend money. Providing a service inherently has a cost. It’s easiest by far to deploy apps if you just have your own dedicated (virtual or not) machine to play around with—and a server is a cool thing to have on hand anyway. You can get a basic Linode for $20/mo., and cheaper providers exist (though are less cool).
Heroku is also an option, and does have a free tier of one worker (similar to the lowest-tier Linode), but it’s another $36/mo for every extra worker you add. (The number of requests you can handle simultaneously is roughly proportional to the number of workers you have. How many you need depends on your app and how you run it.) The upside is that deploying your app is pretty much turnkey. Heroku has a few clones by now, as well.
As they say (do they?), deploying is a good problem to have: it means you’ve actually built something useful, after all. So go build something while I scramble to write a whole thing about deployment options.

Python FAQ: Descriptors


Python FAQ: Descriptors

How does @property work? Why does it call my __getattr__? What’s a “descriptor”?
Python offers several ways to hook into attribute access—that is, there are several ways you can affect what happens when someone does obj.foo to your object.
The most boring behavior is that the object has a foo attribute (perhaps set in __init__), or the class has a foomethod or attribute of its own.
If you need total flexibility, there are the magic methods __getattr__ and __getattribute__, which can return a value depending on the attribute name.
Somewhere between these two extremes lie descriptors. A descriptor handles the attribute lookup for a singleattribute, but can otherwise run whatever code it wants.
Properties are very simple descriptors. If you haven’t used them before, they look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class Whatever(object):
    def __init__(self, n):
        self.n = n

    @property
    def twice_n(self):
        return self.n * 2

    @twice_n.setter
    def twice_n(self, new_n):
        self.n = new_n / 2

obj = Whatever(2)
print obj.n  # 2
print obj.twice_n  # 4
obj.twice_n = 10
print obj.n  # 5
This does some stuff to create a descriptor object named twice_n, which jumps in whenever code tries to use the twice_n attribute of a Whatever object. In the case of @property, you can then have things that look like plain attributes but act like methods. But descriptors are a bit more powerful.

How they work

A descriptor is just an object; there’s nothing inherently special about it. Like many powerful Python features, they’re surprisingly simple. To get the descriptor behavior, only three conditions need to be met:
  1. You have a new-style class.
  2. It has some object as a class attribute.
  3. That object’s class has the appropriate special descriptor method.
Note very carefully that these conditions are in terms of classes. In particular, a descriptor will not work if it’s assigned to an object instead of a class, and an object is not a descriptor if you assign the object a function named __get__. Descriptors are all about modifying behavior for classes, not individual objects!
Ahem. So, about those special descriptor methods. There are three of them, and your object can implement whichever ones it needs. Assuming this useless setup:
1
2
3
4
class OwnerClass(object):
    descriptor = DescriptorClass()

obj = OwnerClass()
You can implement these methods, sometimes called the “descriptor protocol”:
  • __get__(self, instance, owner) hooks into reading, for both an object and the class itself.
    obj.descriptor will call descriptor.__get__(obj, OwnerClass).
    OwnerClass.descriptor will call descriptor.__get__(None, OwnerClass). Here, it’s polite to just return self, so you can still get at the descriptor object like a regular class attribute.
  • __set__(self, instance, value) hooks into writing.
    obj.descriptor = 5 will call descriptor.__set__(obj, 5).
  • __delete__(self, instance) hooks into deletion.
    del obj.descriptor will call descriptor.__delete__(obj).
    Note this is not the same as __del__; that’s something different entirely.
A minor point of confusion here: the descriptor is triggered by touching attributes on obj, but inside these methods, self is the descriptor object itself, not obj.
You can implement any combination of these you like, and whichever you implement will be triggered. This may or may not be what you want, e.g.: if you only implement __set__, you won’t get a write-only attribute;obj.descriptor will act as normal and produce your descriptor object.

Writing a descriptor

Talking about descriptors involves juggling several classes and instances. Let’s try a simple example, instead: recreating property.
First, the read-only behavior.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class prop(object):
    def __init__(self, get_func):
        self.get_func = get_func

    def __get__(self, instance, owner):
        if instance is None:
            return self

        return self.get_func(instance)

class Demo(object):
    @prop
    def attribute(self):
        return 133

print Demo().attribute
This code sneaks the descriptor in using a decorator. Remember that decorators can be rewritten as regular function calls. The class definition is roughly equivalent to this:
1
2
3
4
5
def getter(self):
    return 133

class Demo(object):
    attribute = prop(getter)
So the descriptor, attribute, is just an object wrapping a single function. When code reads fromDemo().attribute, the descriptor calls its stored function on the Demo instance and passes along the return value.
(The instance has to be passed in manually because the function isn’t being called as a method. If you refer to them within a class body directly, methods are just regular functions; they only get method magic added to them at the end of the class block. It’s complicated.)
With this implementation, code could still do obj.attribute = 3 and the descriptor would be shadowed. Want setter behavior, too? No problem; add a __set__.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class prop(object):
    # __init__ and __get__ same as before...

    def __set__(self, instance, value):
        self.set_func(instance, value)

    def setter(self, set_func):
        self.set_func = set_func
        return self

    def set_func(self, instance, value):
        raise TypeError("can't set me")

class Demo(object):
    _value = None

    @prop
    def readwrite(self):
        return self._value

    @readwrite.setter
    def readwrite(self, value):
        self._value = value

    @prop
    def readonly(self):
        return 133

obj = Demo()
print obj.readwrite
obj.readwrite = 'foo'
print obj.readwrite
print obj.readonly
obj.readonly = 'bar'  # TypeError!
Look at all this crazy stuff going on. Take it a step at a time.
The new __set__ method is pretty much the same as before: it calls a stored function on the given instance.
The setter method makes the @readwrite.setter decoration work. It stores the function, and then returns itself—remember, it’s a decorator, so whatever it returns will end up assigned to the decorated function’s name,readwrite. The class definition is equivalent to:
1
2
3
4
5
6
7
8
9
def func1(self):
    return self._value

readwrite = prop(func1)

def func2(self, value):
    self._value = value

readwrite = readwrite.setter(func2)
Don’t be fooled: it looks like there are two readwrite functions, but the class ends up with a single object that happens to contain two functions.
I include a default setter function, set_func, so that properties are read-only unless the class specifies otherwise. It’s got three arguments because it’s a regular method: calling it with (instance, value) will tack the descriptor object on as the first argument.
This is most of the way to an exact clone of Python’s builtin property type, and it’s only a handful of very short methods.

Potential uses

Properties are an obvious use, but they’re built in, so why would you care about descriptors otherwise?
Maybe you wouldn’t. It’s metaprogramming, after all, so you either know you need it or can’t imagine why you ever would. I’ve used them a couple times, though, and I’ve seen them in the wild enough. Some examples:
  • Pyramid includes a nifty decorator-descriptor, @reify. It acts like @property, except that the function is only ever called once; after that, the value is cached as a regular attribute. This gives you lazy attribute creation on objects that are meant to be immutable. It’s handy enough that I’ve wished it were in the standard library more than once.
  • SQLAlchemy’s ORM classes rely heavily on descriptors: SomeTableClass.column == 3 is actually using a descriptor that overloads a bunch of operators.
  • If you’re writing a class with a lot of properties that all do similar work, you can write your own descriptor class to factor out the logic, rather than writing a bunch of similar property functions that all call more methods.
  • If you find yourself writing a __getattr__ with a huge stack of ifs or attribute name parsing or similar, consider writing a descriptor instead.
  • Ever wonder how, exactly, self gets passed to a method call? Well, methods are just these class attributes that do something special when accessed via an object… surprise, methods are descriptors!

Descriptors and AttributeError

One final gotcha. A __get__ method is allowed to raise an AttributeError if it wants to express that the attribute doesn’t exist. Python will then fall back to __getattr__ as usual.
Consider this, then:
1
2
3
def __get__(self, instance, owner):
    log.debg("i'm in a descriptor!")
    # do stuff...
log.debg probably doesn’t exist, so that code will raise an AttributeError… which Python will take to mean the descriptor is saying it doesn’t exist. This is probably not what you want. Be very careful with attribute access inside a descriptor, especially for classes that also implement __getattr__.

Conclusion

  • property is cool.
  • Descriptors are cool.
  • They aren’t hard to write, if you can keep self and instance straight.
  • They only work as class attributes!

Python FAQ: Passing


Python FAQ: Passing

How do I pass by reference? Does Python pass by reference or pass by value?
This question is most often asked by C++ immigrants, who are used to a firm distinction between these kinds of passing and a bunch of subtle pros/cons for each.
So, then, does Python pass by reference or value?
Short answer: objects are passed as if by reference, not copied. If you change an object in a function, it’ll change in the caller. But! You can’t assign to an argument name and magically have values in the caller change.
Long answer: both, and neither. Hmm. This may require some context.

References and values

In C and C++, variable declarations are really memory declarations. Consider this innocuous statement:
1
int x;
This doesn’t really create a thing named x. What it does is ensure that, at runtime, there will be some chunk of memory somewhere big enough to hold an integer, and whenever your code says x, it will look in that same chunk. For all you care, that block might be in RAM or swap or hibernated or on the moon somewhere. If you use register, it won’t be system memory at all. All x refers to here is a wink and a nod between you and your compiler, agreeing that whenever you say x, you mean the same place as every other time you say x.
Enter function calls.
1
2
3
void do_the_needful(some_bigass_struct foo) {
    /* ... */
}
some_bigass_struct foo is still a variable declaration. At runtime, you’ll have a chunk of memory the size of that struct, and anytime you say foo inside this function, you’re guaranteed to be talking about the same chunk of memory.
Because of this, anything used as a function argument is copied. When this function is called, foo contains a byte-for-byte copy of whatever struct was actually used as an argument. This is pass-by-value: the function receives an equivalent value, but it has a different identity (or memory location, if you must).
Clearly this isn’t going to work so well for nontrivial types. You waste a lot of time copying this whole struct, and then your function can’t even change anything and have it reflected in the caller’s struct, because you only have a copy.
The C way to fix this is to pass a pointer, instead. That’s still technically passing by value, but the “value” here is a memory address. That’s only a few bytes. And even though the pointer’s identity is different, it still points tothe same single struct, so a function can muck about with the struct contents if it so pleases.
Along comes C++. C++ decided that pointers were confusing, because universities were inexplicably trying to teach pointers to CS102 students who barely understood what a compiler was for, and the students weren’t getting it. Well, gosh, let’s fix this by getting rid of pointers.
C++’s solution to the pass-a-bunch-of-stuff problem was to introduce references.
1
2
3
void do_the_needful(some_bigass_struct &foo) {
    // whoa, inline comment
}
Now you can call do_the_needful(bar) without fear. It still looks like the entire struct is being passed in, but the & reference sigil causes foo to be an alias for bar. In other words, foo no longer reserves some runtime chunk of memory; it becomes another way to talk about the same chunk of memory the caller has, somewhere. And because foo is bar, you can even assign to foo and overwrite bar outright—in C, you’d generally use a double pointer to do that without copying.
This is pass-by-reference: the same chunk of memory is now shared by two different variables, a feat that is impossible in C.

Back to Python

With these (hopefully-clear) definitions, let us consider Python again.
1
2
3
4
5
def do_the_needful(foo):
    pass

obj = SomeBigassClass()
do_the_needful(obj)
So, is foo passed by value or reference?
Again, the short answer is “neither”. But the real answer is that the question doesn’t make sense for Python! Variable names aren’t fixed preallocated chunks like they are in C or C++. Python variable names are just that:names.
Compare:
  • In C, int x = 3; declares a memory chunk named x and writes the value 3 into it.
  • In Python, x = 3 creates a value 3 and makes x a name for it. All values are objects and thus first-class entities; they can exist with several names or no name at all.
If it helps: C variables are boxes that you write values into. Python names are tags that you put on values. This is a cool illustration.
And much like in C, argument passing is just a funny way of doing assignment. The foo argument in this function might as well have been assigned to with foo = obj; the effect would be the same.
It’s not pass-by-value, then, because there’s no copying done, and the function still has the same object as the caller. (Python never copies anything implicitly.) Is it pass-by-reference? This sure sounds like C++ references so far.
1
2
3
4
5
6
7
def increment(n):
    n = n + 1

i = 1
print i
increment(i)
print i
Nope; this will just print 1 twice. Inside the function, assigning to n doesn’t do anything to the value n refers to; it just makes the name, n, refer to something else now. So n will be 2, sure, but then the function ends and n goes away and i is left unchanged because you never did anything to i.
This is different from changing an existing value:
1
2
3
4
5
6
7
def lengthen(n):
    n.append(2)

i = [1]
print i  # [1]
lengthen(i)
print i  # [1, 2]
In this case, n was never reassigned; instead, a method call altered the value directly. It’s still the same list, and both n and i refer to it, but the list’s contents changed.
Got it? Good, because there’s one more wrinkle: operator overloading does weird things here. You could rewrite both of these functions using +=, for example. In incrementi wouldn’t change, but in lengthen, it would! This is because ints (and strs, tuples, and some other types) are immutable, so they implement += literally: by creating a new object and assigning it. But lists are mutable, so as a convenience shortcut, += acts like.extend() and changes the list in-place. This quirk has nothing to do with passing, though; these types just overloaded += differently.
Anyway, um, this is definitely not pass-by-reference either.
If anything, Python is a third option: pass-by-object.

What to do instead

So, wait, what if you do want to write something like increment?

Return stuff.

Much of the use of pointer/reference arguments in C and C++ is for “out parameters”: the function returns some status value, and its actual results are “returned” by modifying particular arguments.
But this ain’t C, so why would we do that? You can just return multiple values.
1
2
3
4
def foo():
    return True, "abc"

status, value = foo()
Or, you know, just raise exceptions on failure. Then the caller doesn’t get a nasty surprise when he forgets to check your status code.

Use methods.

If you really need to mutate the caller’s values, you might want to use an object to store those values, and turn the function into a method. Methods can mess with the invocant’s attributes all they want, and this keeps the mess nicely contained.
1
2
3
4
5
6
7
8
9
10
11
class Incrementer(object):
    def __init__(self, count):
        self.count = count

    def increment(self):
        self.count += 1

i = Incrementer(1)
print i.count  # 1
i.increment()
print i.count  # 2

Use a mutable object.

As a last resort, you can always put the values into a list (or dict, object, etc.), pass that to the function, have the function mutate the list, then extract the new values on the outside.
That’s gross, though. Don’t do that.

Under the hood

If you must know! In CPython, every Python value is actually a PyObject*. So argument passing, assignment, etc. actually act fairly similarly to C, if you wrote C where absolutely everything were a pointer (and there were no double pointers for cheating).
1
2
3
4
5
6
7
void increment (int *n) {
    int newval = *n + 1;
    n = &newval;
}

int i = 1;
increment(&i);
This is the spiritual equivalent to the Python function above. (Please ignore the impending segfault.) Assigning ton naturally does nothing, because only the pointed-to value is shared. But if that value were something mutable like a list, you could change it in-place.
And this is why “both” is a correct answer as well: you could say that Python is pass-by-value, where the values are pointers… or you could say Python is pass-by-reference, where the references are copies. Or you could say it’s “pass-by-pointer”. But now you’re thinking too hard about it.

Conclusion

  • Python functions can’t replace what names in the caller refer to.
  • Reassigning an argument name won’t do anything useful.
  • Python functions can mutate their arguments, if the arguments are mutable.
  • Nothing is implicitly copied in Python.
  • Stop comparing Python so closely to C++ and you’ll have a much better time.

Android Community Weekly: August 12th, 2012


Android Community Weekly: August 12th, 2012

With the past week finally winding down one of the best ways to enjoy your Sunday afternoon is to catch up on all the latest Android news. Here at Android Community we had another exciting week of news. With the OUYA gaming console dropping all sorts of news, and T-Mobile launching three new smartphones. Head on below for your weekly fix!


To start the week off was exciting news for Motorola DROID 4 fans, as they finally received Android 4.0 Ice Cream Sandwich from Verizon. Then Motorola teased a new smartphone on Facebook that ended up being nothing but the already released Atrix HD on AT&T. The Nexus 7 was in the news again, as it has been for a few months straight. This time it was mounted inside a Dodge RAM for some in-dash entertainment — and was amazing.

That isn’t all the N7 did either. As you can see above it was overclocked to 1.7 GHz on that quad-core processor and blew away the benchmarks. Thanks to the Trinity kernel it has far exceeded the limits we thought it was capable of. Google’s Google Now search arrived in small pieces for the iPhone, and the rest of the non Jelly Bean Android users are hoping they get the same treatment too. Then as we mentioned above T-Mobilelaunched two new smartphones on the 8th. One being their version of the original Samsung Galaxy Note, then their budget friendly myTouch smartphones.
The OUYA game console added a list of games and even XBMC support, as well as multiple announcements you can see in the timeline below. To round off the week HTC announced the ThunderBolt would see Ice Cream Sandwich before the end of August, and same goes for their Desire S. Last but certainly not least we are giving away some brand new Nexus 7 tablets. Yes you read that right! Android Community and NVIDIA have partnered again and are giving away 3 brand new 16GB Nexus 7 tablets. Full details can be found in our Nexus 7 Giveaway post. Hit the links below for more coverage from this week, and enjoy the rest of the weekend.

Recent Posts