Jay Taylor's notes

back to listing index

The Collapse of the UNIX Philosophy

[web search]
Original source (kukuruku.co)
Tags: history unix kukuruku.co
Clipped on: 2017-03-02

Image (Asset 1/2) alt=
There’s just one thing I’d like to say. We still don’t know how to work conveniently with strings in C. The inconvenience of working with strings always leads to various security issues. This problem is still not solved! Here’s a relatively recent document from the committee of C. It discusses quite a questionable way of solving the problem with strings. The conclusion is that this method is bad. Year of publication: 2015. This means that there wasn’t a final solution even by 2015!

Say nothing of the lack of a simple, user-friendly and multiplatform build system (not this autotools monster that does not support Windows, and not another cmake monster that supports Windows but it’s still a monster), the standard package manager, and a user-friendly, like npm (js) or carge (rust), portability library, with the help of which one could at least read content of the folder across all platforms, and at least the main website of C that would be the main entry point for all beginners and would contain not only documentation but also a brief manual on installing C tools on any platform, as well as a manual on creating a simple project in C, and would also contain a user-friendly list of C packages (that should be in the standard repository), and, most importantly, it would be a gathering place for the user community. I’ve even registered c-language.org domain hoping to create such a website there. Yeah, dream on! (I even have cpp-language.org, bwahaha!). But they don’t have all of this, even though all popular languages have it, except for C and C++. Even Haskell has this! And Rust!

In Rust, this jackanapes, that is aimed at the same niche as C, there’s a single config that is also a project config, a builder config, and a config for the package manager (actually, cargo is a package manager and a build system at the same time). As a dependency for the given package, it is possible to specify another package located somewhere in GIT, including specifying a GITHUB repository as a direct dependency. An out-of-the-box support for documentation generation from markdown comments in source code. A package manager that uses SEMVER for versions. So, GIT, GITHUB, MARKDOWN, SEMVER. In other words, BUZZWORDS, BUZZWORDS and HIPSTERS’ BUZZWORDS. All of this out-of-the-box. You just go to their main website, and here it is on a silver platter. All of this works the same way on all platforms. Despite the fact that Rust is a system programming language, and not just some JavaScript. Despite the fact that we can play bytes in Rust. There’s also a pointer arithmetic in it. So why do Rusters have all of these hipster buzzwords, and we, C-guys, don’t? What a shame.

I remember how one of my friends asked me where to look for list of packages for C/C++. I had to tell him that there’s no such place. He asked me whether C/C++ programmers must suffer. I had nothing to tell him.
Oh, right. I forgot one more thing. Take a look at a prototype of a signal function in the form we see it in the C standard:

void (*signal(int sig, void (*func)(int)))(int);

Try to understand it.

  • Terminals in UNIX — weird legacy. The details are here.

  • Filenames in Unix file systems (ext2 and others) are simply a stream of bytes with no encoding. It depends on the locale what encoding they will be interpreted in. So, if we create a file in the operating system in one locale, and then try to take a look at its name in another locale, nothing good will come out of it. There’s no such problem in the Windows NTFS.

  • UNIX shell is worse than PHP! Yes, it is, didn’t you know? It’s a popular thing nowadays to criticize PHP. But UNIX shell is even worse… It becomes especially bad when we try to develop in it, as it’s not a full-fledged programming language. But it’s no good even for its niche, scripting common administrative tasks. The reasons for it are shell primitivity, general inefficient arrangement, legacy, tons of special cases, dirty hacks, a complete mess with quotation marks, backslashes, special characters and shell’s obsessiveness (just like the entire UNIX ) with plain text.

    • Let’s begin with a teaser. How can we recursively find all the files with \ name in a folder foo? The correct answer is: find foo -name '\\\\'. We can also do it like this: find foo -name \\\\\\\\. The latter way will cause lots of questions. Try to explain to a person who is not good at UNIX shell why exactly four backslashes are necessary here, not two or eight. We need to write four backslashes here as UNIX shell performs backslash expanding, and find does it too.
    • How to touch all files in foo (and its subfolders)? At first glance, we could do it like this: find foo | while read A; do touch $A; done. Well, at first glance. Actually, we can come up with 5 things that can ruin it all (and lead to security problems):
      • Filename can contain a backslash. Therefore, we should write read -r A instead of read A
      • Filename can contain a slash. That’s why we should write touch "$A" instead of touch $A
      • Filename can not only contain a space but also start with a space. So we need to write IFS="" read -r A instead of read -r A_
      • Filename can contain a newline, so we should use find foo -print0 and instead of IFS="" read -r A use IFS="" read -rd "" (I’m not really sure here)
      • Filename can start with a hyphen, so we need to write touch -- "$A" instead of touch "$A". The final version looks like this: bash find foo -print0 | while IFS="" read -rd "" A; do touch -- "$A"; done Cool, isn’t it? By the way, we didn’t take into account that POSIX does not guarantee that touch supports option --. Considering this fact, we’ll have to check each file on whether it starts with a hyphen (or that it does not start with a slash) and add ./ to the beginning. Do you understand now why configure scripts generated by autoconf are so large and difficult to read? Because configure needs to take into account all of this crap, including compatibility with various shells. In this example, I used the solution with pipe and loop. I could also use the solution with exec or xargs, but it wouldn’t be so eye-catching. (Well, okay. We know that the filename starts with foo, so it cannot start with a space of hyphen).
    • Let’s say we need to delete a file on host a@a. The name of the file is in a variable A. How can we do it? Perhaps, like this: ssh a@a rm -- "$A"? (As you might have noticed, we have already taken into account that the filename can contain spaces and start with a hyphen) Never ever do this! ssh is not chroot, or setsid, or nohup, or sudo or any other command that receives an exec-command (meaning a command for direct transmission of the execve family by system calls. ssh (just like su) receives a shell-command, i.e. a command for processing by shell (the exec-command and shell-command are of my own). ssh combines all the arguments into a string, and passes the string to the remote side and performs by shell there. Okay, maybe like this: ssh a@a 'rm — "$A"'? No, this command tries to find variable A on the remote side. But it’s not there, as variables cannot be passed via ssh. Well, maybe like this: ssh a@a "rm -- '$A'"? Nope, this won’t work if the filename contains a single quote. Anyway, the correct answer is: ssh a@a "rm -- $(printf '%q\n' "$A")" Convenient, don’t you think?
    • How to get to host a@a, and then to b@b from it, then to c@c, and then to d@d and delete the /foo file from it? Well, this one is simple: bash ssh a@a "ssh b@b \"ssh c@c \\\"ssh d@d \\\\\\\"rm /foo\\\\\\\"\\\"\"" Too many backslashes, huh? Well, if you don’t like it, let’s alternate single and double quotation marks: bash ssh a@a 'ssh b@b "ssh c@c '\''ssh d@d \"rm /foo\"'\''"' By the way, if we were to use Lisp instead of shell, and the ssh function would pass not a string but a parsed AST (abstract syntax tree) to the remote side, there wouldn’t be so many backslashes: lisp (ssh "a@a" '(ssh "b@b" '(ssh "c@c" '(ssh "d@d" '(rm "foo"))))) “Huh? What? Lisp? What Lisp?” Curious, aren’t you? Go read here. You can also refer to other articles by Paul Graham.
    • Let’s combine the previous two paragraphs. A name of the file is in a variable A. We need to go to a@a, and then to b@b, then toc@c, d@d and delete the file in variable A. I’m going to leave it for you as an exercise. (I don’t know how to do it. :) Well, I might if I thought about it)
    • echo is sort of designed for displaying strings on the screen. But the thing is, we can’t use it for this purpose if the string is a bit more complex than “Hello, world!” The only true way to print a random string (e.g. from variable A) is like this: printf '%s\\n' "$A".
    • Suppose you want to direct stdout and stderr cmd commands to /dev/null. The riddle: which of these six commands perform the task? bash cmd > /dev/null 2>&1 cmd 2>&1 > /dev/null { cmd > /dev/null; } 2>&1 { cmd 2>&1; } > /dev/null ( cmd > /dev/null ) 2>&1 ( cmd 2>&1 ) > /dev/null Turns out, the correct answer is: the 1st, the 4th and the 6th. And the 2nd, the 3rd, and the 5th don’t. And again, I’m leaving it to you to figure out the reason as an exercise. :)
  • Actually, this post appeared as a response to this one. It says that a special date is used in Windows as a driver timestamp. Instead of introducing a special attribute or checking the manufacturer. There’re lots of similar things in UNIX. The file is hidden only based on a dot at the beginning of the file instead of the special attribute. I was shocked when I first learnt about it (yeah, in those old times I installed Ubuntu for the first time). “What an idiots!”, I thought. But I’m used to it now. But thinking about it, it’s a terrible workaround. Then, shell decides whether it is a login shell based on the hyphen passed by the first character to argv0. This is a misuse of argv[0]. argv[0] is not meant for this purpose. Any other method would be better, e.g. using another argument or some environment variable.

  • In BSD sockets, user must change the byte order of the port number on its own. And all because someone has made a mistake in the UNIX kernel code, missing to foresee the byte order change. As a temporary hack, this someone fixed the user space instead of the kernel code. That’s how it goes. That’s where it came from to Windows (together with file /etc/hosts, aka C:\windows\system32\drivers\etc\hosts). The Source.

  • UNIX Philosophy

    Some people think that UNIX is great and perfect, and that all its basic ideas («everything is a file», «everything is text» and so on) are amazing and form the so-called ”UNIX Philosophy”. I guess you’re starting to understand that it’s not quite so. Let’s review this “Unix philosophy”. Have a look at some points below. I’m not trying to say that all of these things should be canceled, I’m simply pointing at some drawbacks. * “Everything is text”. As we’ve already seen in the example with /etc/passwd, the widespread use of plain text can lead to performance problems. UNIX authors have actually invented a format for each system config (passwd, fstab, etc.). With their rules of escaping special characters. Surprised? /etc/fstab uses spaces and line breaks as separators. But what if folder names include, say, spaces? For this case, the format of fstab provides special escape characters for folder names. Turns out, any script reading fstab should be able to interpret the escape character. For example, the fstab-decode utility meant for this purpose (run as root). You didn’t know this, did you? Go fix your scripts. :) As a result, we need a parser for each system config. It would be much easier if we used JSON or XML for system configs. Or maybe some binary format. Especially for those configs that are constantly read by different programs. As a result, they need a good read rate (it’s higher in binary formats).

    That’s not all I wanted to say about “everything is text”. Standard utilities provide the output in the form of a plain text. For each utility, we actually need a parser of its own. We often need to parse an output by using sed, grep, awk, etc. Each utility has its own options to determine what columns to display, what columns to sort by, and so on. It would be better if utilities provided the output in the form of XML, JSON, some binary format or anything else. To display this information in a user-friendly way and to work further with it, we could pipe the result to additional utilities that remove some columns, sort by some column, select the required strings, etc. They either display the result in the form of a nice table or pass it somewhere else. All of this is carried out in a multipurpose way that does not depend on the initial utility that generated the output. With no need to parse anything by regex. UNIX shell isn’t good at working with JSON and XML. But UNIX shell has plenty of other drawbacks. We’d better throw it away and use some other language that can work well with things like JSON and do lots of everything else.

    Just imagine! Let’s say we need to delete all files in the current folder of size bigger than 1 kilobyte. Yes, I know that we can do this with find. But let’s suppose we definitely need to do this via ls (and without xargs). How to do it? Like this:

    LC_ALL=C ls -l | while read -r MODE LINKS USER GROUP SIZE M D Y FILE; do if [ "$SIZE" -gt 1024 ]; then rm -- "$FILE"; fi; done
    

    We need LC_ALL here to be sure that the date will take exactly three words in the output of ls. This solution not only looks ugly, but also has a number of drawbacks. Firstly, it will not work if the file name contains a line break, or begins with a space. Next, we need to explicitly list the names of all ls columns or at least remember where the ones we need (i.e. SIZE and FILE) are located. If we make a mistake in the order of columns, the error will become apparent only during the runtime. When we delete the wrong files. :)

    How would the solution look like in the perfect world I’m suggesting? Something like this: ls | grep 'size > 1kb' | rm. It’s short, and, most importantly, you can see the meaning in code and it’s impossible to make a mistake. Let’s see. In my world, ls always gives all the information. We don’t need a special -l option for this. When it’s necessary to delete all columns and leave the filename only , we can do this with a special utility we should direct the ls output to. Thus, ls provides a lift of files in some structured form, say, JSON. This representation “knows” names of columns and their types. i.e. that is a string, a number or something else. Then, this output is piped to grep that, in my world, selects the necessary strings from JSON. JSON “knows” field names, so grep “understands” what “size” means here. Moreover, JSON contains information about the type of size field. It contains information that it’s a number, and even that it’s not just a number but a file size. Therefore, we can compare it to 1kb. Next, grep pipes the output to rm. rm “sees” that it’s going to receive files. Yes, JSON also stores information about the type of these strings, that they’re files. rm deletes them. JSON is also responsible for correct special characters escaping. That’s why files with special characters “simply work”. Cool, right? I took the idea from here. It should also be mentioned that something of the kind is implemented in Windows Powershell.

    • UNIX shell. Another basic idea of UNIX. I’ve already mentioned small disadvantages of UNIX shell in the first part of the article. What’s “cool” about UNIX shell? At the moment of its release (it was a long time ago), it was much stronger than command interpreters embedded in other operating systems. It allowed to write more powerful scripts. Seems like the UNIX shell was the most powerful scripting language when it was released. Because there were no sane scripting languages back in the days (meaning the ones that would allow full-fledged programming and not just scripting). It was later when a programmer named Larry Wall noticed that UNIX shell lacked a lot to be considered a good programming language. He decided to combine the simplicity of UNIX shell with the ability of full-fledged programming from C. He created Perl. Yes, Perl and other following script programming languages actually replaced UNIX shell. Even Rob Pike, one of the authors (to my mind) of “UNIX philosophy”, confirms the fact. Answering a question about “one tool for one job” here, he said: “Those days are dead and gone and the eulogy was delivered by Perl”. Actually, I believe that this phrase refers to a typical use of UNIX shell, i.e. to a situation of combining a big number of small tools in a shell script. No, says Pike, simply use Perl.

      I’m not done talking about UNIX shell. Let’s review the example of shell code I’ve already provided above: shell find foo -print0 | while IFS="" read -rd "" A; do touch -- "$A"; done We call touch in the loop here (I do know that the code can be re-written with xargs, so that touch is called only once; but let’s forget about it for now, okay?). We call touch in the loop! This means there is a new process for each file. This is extremely inefficient. Code in any other programming language will work faster than this one. But when UNIX shell appeared on the scene, it was one of the few languages that allowed writing this action in one string.

      Long story short, we should use any other script programming language instead of UNIX shell. A language that will be suitable not only for scripting, but for real programming as well. A language that does not run a new process every time we need to touch a file. Perhaps, we’ll have to borrow some features from shell to simplify things even more.
    • Simplicity. I’m not talking about shell and combining lots of simple utilities from shell (it was the previous point). I’m talking about simplicity I general. Using simple tools. Say, editing a picture with sed. Yes, yes. To convert jpg into ppm with the help of the command line. Then, edit the picture using the graphic editor, grep, and sed. And then convert it back into jpg. Yes, we can do this. However, it’s often better to do this with Photoshop or GIMP, although they’re large, integrated programs. Not in the UNIX style.

    I guess I’ll finish adding points now. Yep, that’s enough. There’re some ideas in UNIX I really like. Like this one: “a program should do one thing, and do it well”. But not within the context of shell. I guess you’ve realized by now that I don’t like shell. (Once again, I think that in the interview provided above Rob Pike took the principle “a program should do one thing, and do it well” in the context of shell and therefore rejected it) I’m talking about this principle in its essence. For example, a console mail client should not have a built-in text editor; it should just run some external editor. Or the principle, by which one must write a program’s console kernel before its graphical user interface.

    Now, the general picture. Once there was UNIX. It was a breakthrough at the time. It was better than its competitors in lots of things. UNIX had a lot of ideas. As any other operating system, UNIX required from programmers compliance with certain principles when writing application programs. The fundamental ideas got the name of “UNIX philosophy”. One of the people who formulated the UNIX philosophy was the already mentioned Rob Pike. He did this in his presentation titled “UNIX Style, or cat -v Considered Harmful”. After the presentation, Rob Pike and Brian Kerninghan published an article based on the presentation. They told us that, say, the purpose of cat is concatenation and nothing else. Perhaps, Rob Pike was the one to invent the “UNIX philosophy”. cat-v.org website was named after this presentation. Read it, it’s a very interesting website.

    But then, many years later, Pike made two more presentations, in which, I think, he abandoned his philosophy. You got it, fans? Your idol gave up his own philosophy. You can go home now. In the first presentation“Systems Software Research is Irrelevant” Pike complains that no one writes new operating systems anymore. Even if they do, it’s another UNIX: “New operating systems today tend to be just ways of reimplementing Unix. If they have a novel architecture — and some do — the first thing to build is the Unix emulation layer. How can operating systems research be relevant when the resulting operating systems are all indistinguishable”

    The second Pipe’s presentation is titled “The Good, the Bad, and the Ugly: The Unix Legacy”. Pike says that flat text is not multipurpose; it’s good but it doesn’t always work: “What makes the system good at what it’s good at is also what makes it bad at what it’s bad at. Its strengths are also its weaknesses. A simple example: flat text files. Amazing expressive power, huge convenience, but serious problems in pushing past a prototype level of performance or packaging. Compare the famous spell pipeline with an interactive spell-checker”. Then: “C hasn’t changed much since the 1970s… And — let’s face it — it’s ugly”. Then Pipe admits the limitation of pipes connecting simple utilities, as well as the limitation of regex.

    UNIX was genius at the time of its introduction. Especially if we remember what tools the authors of UNIX had. They didn’t have a ready-made UNIX to develop UNIX on. They didn’t have IDE. They even developed in assembler at the beginning. I guess the only things they had were an assembler and a text editor.

    At a certain point, people standing at the origins of UNIX began to write a new operating system: Plan 9. Including Ken Thompson, Dennis Ritchie and Rob Pike. Taking into account the numerous mistakes of UNIX. However, no one raises Plan 9 on a pedestal. Pike mentions Plan 9 in “Systems Software Research is Irrelevant”, but still encourages us to write new operating systems.

    James Hague, a veteran of programming (has been in programming since the eighties), writes the following: “What I was trying to get across is that if you romanticize Unix, if you view it as a thing of perfection, then you lose your ability to imagine better alternatives and become blind to potentially dramatic shifts in thinking”. The Source. Read this article and the one he refers to: «Free Your Technical Aesthetic from the 1970s». Actually, if you like my article, you’ll like his blog too.

    So, I do not want to say that UNIX – is a bad system. I’m just drawing your attention to the fact that it has tons of drawbacks, just like other systems do. I also do not cancel the “UNIX philosophy”, just trying to say that it’s not an absolute. My post is mostly referred to UNIX and GNU/Linux fans. The provocative tone has been used just to attract your attention.