LCA 2003: Bugs talk

I gave a talk at LCA 2003 in Perth about bugs. My slides, such as they are, are in magicpoint format. They are not very comprehensive. This is my recollection of what I said about them. It is subject to (extensive) change when I get my hands on the ogg file and find out what I actually said.

Making the right sort of difference: bugs and what to do with them

Why am I the one to give this talk?

Well, yes. They asked me. Of the awful lot, I have been asked how many I have reported. The awful truth is that I don't really know. I really doubt that it is the thousand or more some people think. I wish it were that high, but no. I think in the low hundreds, that's all. The measurable fraction of the old GNOME bug-tracker is something like 0.5%, or one in every two hundred bugs. It's only fair to note that not many people knew about the GNOME bug-tracker at the time. I had an advantage.

What is a bug?

Someone shouted another something out from the audience, but I don't remember what it was.

Where do bugs live?

The point of that somewhat arbitrary list is that I have met and reported them in all of these. The funniest command-line app I ever broke was wget. Somehow. Don't ask. As for the web, I don't remember whether I used this example, but one example I intended to include was that of a friend who has a thirty-character email address. Many web pages won't accept email addresses which are that long, and so she can't use web services which do this.

The remark about apps you are actively trying to use is because I'm convinced it's true. You can light-heartedly try to mess with applications, but unless you have a fairly complete list of standard things to try, you won't break them. But as soon as you use it seriously, you will find yourself doing things that aren't on your list of things to try, and you will break it.

Finding them (1)

Other than the first line, I think all of these are important. A good example of why not to make your notes on the machine in question (or whilst using X and ssh'd into another machine) by editing a text file is one I did whilst walking through the tests for GNOME 1.3 programs: if you really do mangle things up, the box will freeze and not even a journalling filesystem will rescue. When I have thoroughly killed X or GNOME, I have once or twice had to ssh in, run top and send the right signal to the editor process and hope it leaves a DEADJOE file for me to retrieve later.

Paper doesn't crash. Coffee stains are a problem, I grant. One thing SuSE used to do (still does?) is to include the man page for fsck in the printed manual. Because if you boot up and meet fsck waiting for you, you can't just open the man page if that's your only computer.

Finding them (2)

This was a list of things to try if you want to find bugs. They work in text or GUI mode. Feeding inappropriate numbers or strings in is an old game. Something always to try is the semi-colon. Just in -case- you can get it to think that marks the end of a command and then it tries to run the next characters. Shouldn't be able to, but you never know :)

Changing your locale is a great one. The number of things which behave differently is huge. I once spent ages wondering why my tr script to imitate rot-13 had broken after an upgrade. That was the upgrade where RH stopped using a default locale of C and started using one of en_something. This wasn't actually a bug, but it took me a while to track down.

Change the system time, too. (If you're the only one on the system.) Be aware that setting the clock to an earlier time can have drastic effects when you're in X, so make sure you have everything saved before you try this.

The change between summer time and winter time has great scope for chaos, and happens for two hours or so out of a year. There was a lovely bug in gnome-pim once which only showed up when the clocks went forward (or was it back?). We grabbed the author on IRC and he absolutely groaned. We thought we'd fixed this after the clocks went forward last year. Because the US and Europe go forward and back at varying dates, there was then a frantic rush to get a fix out before the US people got it too. As I recall, we still got a pile of duplicates anyway.

Finding them (3)

When I mentioned the first one (install and run with a quota of 5kb to simulate no disc space), there was a groan from some people in the audience. I gather they must be the developers who have to cope with the fallout if people do this. What you typically find is that configuration files don't get saved when you quit the app, no matter how many changes you have made and hope to save, or -- even better -- only half the file gets written back.

I also managed to work in something that Malcolm Tredinnick had mentioned (only the day before) meeting when things ran out of space. Apparently a co-worker was once hunting down a problem and looked at the log files, to find, in /var/log/messages, the glorious message
Warning! /var running out of spac
(If you haven't met this, yes, it really should be space at the end...)

If you want to mess with Does it work on a dodgy network?, a laptop is unquestionably of use here. Instead of messing with commands to bring the network interface up and down, you can just pull the PCMCIA card out at random intervals. When Ximian were testing Red Carpet, I had great fun with this. They had to make and send me a debugging build of the thing.

Documentation is definitely fun for bug-hunting. When you are documenting (if you are doing it responsibly), you can't say Pressing this button does this unless you've checked it does. So you end up pressing every single button and trying every combination in the process. And you will certainly find bugs. An early version of gcalc once received a documentation submission which included the line Pressing this button current crashes the application. I do not know what it should do. Once the developer saw that, this particular bug was fixed really quickly..

Once you've written your docs, give them to someone else and tell then to follow them precisely. Or do that to other docs. If the /usr/share/doc/package*/README says to make the config file in /usr/local/etc even though you're on a package-managed system, do that. See whether it gets used. If the filename is all upper-case, or has a typo or a full stop, create it like that. See what breaks.

Breaking graphical apps

These were things to try on GUI applications. Resizing is always interesting. Alan once found that he could bring the slider in Pan past the toolbar and outside the window. He rendered the entire app unusable with no way to get it back. Even better, Pan saved itself when you quit. So you had to kill all your preferences by nuking that file, and there are a lot of preferences you can set in a newsreader.

If you want to break something like a graphical display manager, you want to look at xnest, which lets you start a second X server on the display of your first one. You can break things in that one and still retain your working setup in comfort.

And when in doubt, in any druid or wizard or sequential set of screens, there is always pressing the back button repeatedly. This kills all manner of things. Mozilla used to hate it!

So you have a bug. Woo! Now what?

The best sort of bug is the sort you can make happen again and again with exactly the same steps. Developers will respond to those in preference to bugs that happened once-only, in my experience.

You can get a lot of interesting output if you run a command from the command-line rather than a menu.

And I only mentioned gdb for a laugh. If you must use gdb, you're definitely best to use one of the tools which will run it automatically for you.

Cool tools (1)

I like package management because it makes it so much easier to find out where a bug started or stopped without installing half a dozen varieties from tarballs and then trying to get rid of spare files and links: such links can drastically affect how (and whether) an app behaves.

Most of the programs I mention are really the sorts of thing you want to wait for the developer to ask you to. And if they ask for strace output, beware! It's gigantic. Look up the options to prune it, or redirect it into a file. (If you do this, better not still have quotas on: you can easily generate megs with strace.)

script is really useful and captures everything which goes to a terminal (except passwords). But if you try to edit the file, don't get a shock. Everything means everything. It even puts a ^M in every time you hit return.

I prefer to start X from a text login because you can redirect all the errors and output at the console into a file. With a GUI login, you lose all that.

Cool tools (2)

This was meant to be the non-commands. The externals to your computer. Pen and paper I already mentioned. A tape recorder is great when you don't want to copy down all those sequences of four digits you get in a really good X crash, or a kernel OOPS. Especially if you are copying them down and the power-saving kicks in and the screen goes blank (which happened to me more than once).

Apparently kernel hackers are now entirely used to receiving digital photos of OOPSes. Only ever send the URL though, not the entire picture.

Having tried to stick to non-computing things in this section, I think I went off into an explanation of why a network was important for telling the difference between X and kernel crashes and how to do so. Essentially, if you can ssh to the machine, something is still up, and so it's an X crash or hang. By logging in and running top you can get a list of what's chewing CPU or memory (hit M in top for that). I have always been told that you can shoot off X clients one at a time and this may unwedge the X server. It has rarely worked for me. The only exception was Netscape, which reliably could hang X, and which killing could (sometimes) fix things.

If you have a kernel crash, there are techniques which are supposed to help. One of them is to hook up a serial console. I did this for one recurring crash. The serial console made not the slightest difference for me. There are also said to be facilities in the kernel to make the keyboard lights flash in patterns that tell you what the problem is. These have never worked for me either. (Although someone after the talk did tell me they worked for him.)

If it's not documented, it is a bug

This was simply a list of all the places you may find information on whether it's a known bug or not. Yes, there are loads of places you can look. I forgot to include Google, too.

Gathering information

When it comes to the version of the application, please remember that Current CVS or Latest release are not version numbers. There is no saying when your report will be read; and no saying that either the machine you send from or the machine that records the report has an accurate clock.

One thing in the list of Does it happen on.. which is important is a test user. A plain default test user with as few alterations to defaults as possible. This really helps. Otherwise, you may end up having to go through half your config files to find the one critical change: or just sending the entire dotfile in the bug report (and the developer may hate you).

People on IRC: they really are responsive. I once found a very large picture on the web. I pasted the URL to IRC, and people started timing out and then coming back and complaining the picture had killed their box. More and more people just had to try. It turned out that they were using a variety of kernels and a variety of image viewers. Some image viewers were trying to load the entire picture into memory (and hitting VM problems), whereas the Gimp, which I used, knew it only needed to display the part of the picture I was actually looking at. Tiling or something.

Sending information.

There's a huge variety of bug-reporting helper apps. I forgot the name of the Debian tool: it's actually reportbug and it sends stuff to the Debian tracker.

One reason to trim down things and never to send core files: you don't always know what's in them. Mutt actually mentions this in its FAQ. It is theoretically possible to get a core file with your GPG passphrase in it. I have heard an argument that if they are that bothered, then mutt should set something which prevents core dumps from being generated (this was Alan, actually: I didn't know you could do this), but until they do, that's a lovely example of why not to send them. Quite apart from the fact that they're gigantic and useless.

Images cause more trouble than you'd think. There is the I am on a 28.8 modem and you want me to look at a 1Mb screenshot? one; there is the point that a picture of your entire 1600x1200 desktop is generally not necessary if you are trying to report something that's happening in one window of that; and then you really would be amazed at what annoys people.

I know people who have policies at their workplace of If you are found looking at a dodgy picture at work, this is grounds for instant dismissal. It is just not fair to open other people up to that just because you like your picture of ... well, whatever. Yes, Mozilla does get bugs about Can't render this adult site picture, but label them clearly. And finally, there are people who find pictures of the sort you can find at stiles.com simply offensive. If they see them unexpectedly, they may decide to skip your bug and deal with a different one.

Gnome has even had complaints about the song titles visible in the playlist of xmms on the screenshots on gnome.org.

The trackers

The only comment I remember making on this slide was that Sourceforge's bug-tracker doesn't work with Lynx.

Stop right there!

For the list of binary apps and modules, I just went through them, and pointed out that Netscape has its own bugtracker which is not Mozilla's. I think I may have mentioned that there are instructions for how to get rid of the binary modules to see whether you can regenerate the bug again, but I didn't go into them.

Some projects are friendlier than others.

This was ill-titled. It was really about picking where to send a bug. People into the audience got into a digression about mplayer, an app I know little about (frankly, the idea of building from CVS scares me). There are projects I am scared to report bugs to, but I didn't name them that I can remember.

It was either here or further down that I read out some responses I had had to one bug-report which were not entirely supportive.

As to why it's important always to get it reported somewhere, I have a story about that, too. It is from way back in the run-up to Gnome 1.4. This was our first We are grown-up and will have a formal release cycle release. We had loads of betas. We had a release team rather than a release person or two co-release managers. We had bugzilla. We had lists of things to test for every app. And the release cycle went on for an age.

And all through it, I kept seeing the occasional mention of something eating all the file descriptors up. Without checking the manual, I could never remember what these things were, but I knew that losing them all was bad. But no-one ever filed a bug, so I always assumed it was not Gnome, it was fixed, or whatever.

And the day for the release approached. And the night before it, I was relaxing on IRC. And someone said they were sure that medusa (an indexing thing) had just eaten all their file descriptors. Someone else observed something along the lines of Yeah, it does that. I blinked, and checked bugzilla. There was nothing about this at all, but suddenly half the channel were telling tales of how they'd had it eat their file descriptors.

And not one of them had filed it in bugzilla, and it was now less than twenty-four hours to release. And I gibbered, and then sent an email to the release-team which began, as I recall, something like You are going to hate me or Please don't kill me but...

And left them to deal with it. In the event, they pulled the package, and the rest of the release went ahead. But had it not been for one person grumbling, we'd have shipped Gnome 1.4 like that and a few weeks down the line we'd have started getting an awful lot of complaints.

Because no-one put it in bugzilla. They all thought someone else would do it.

So that's why it's important to report it somewhere.

I found a HUGE security hole!

Nothing really to add here. I was fretting about time and whizzed on.

Terminology is hard

The first set of terms were all terms I have seen used to refer to the Gnome panel in bug reports or on IRC. Jeff Waugh came up with some more subsequently.

root window is interesting because to old guard types it means the window that all the other windows are drawn on in X. What people often call the background, or the wallpaper, or the desktop. However, to some people it means the terminal they have a root prompt in. It probably shouldn't mean that, but..

Some ways to get your report ignored

The person reading the bug report is possibly not the person who wrote the offending application. So even if you think you're justified swearing at them, you're likely just to irritate or upset the wrong person. If you do get the right person, it's still not going to help. I can't find the reference off-hand, but I clearly remember a KDE developer saying he was no longer interested in maintaining an application because of the pushy and rude feedback he was getting. (If he followed established Gnome precedent, he probably came back a week later, of course.)

If you say you have removed or are about to remove the package, there is absolutely no incentive for the maintainer to fix it. You can't test the fix, you presumably won't use the fixed version, and what's the point?

Developers! Cut down on those pesky bug reports!

Here are some responses which I got to a bug report I once made to a distribution about a package. I deliberately told the distribution rather than the upstream authors because, well, it's free software and if the package maintainers wanted their app to do certain things by default, that was up to them. But I thought the distro in question might want to alter defaults and so on. I will grant it turned into one of my longest laundry-list (always a mistake) bug reports. But I still don't believe it justified these.

Just set your damn terminal up right [...] I use that and it works fine for me. But then, I'm not lazy enough when it doesn't work to blame the programmers who wrote it, instead of myself. It's a user configuration problem.
$distro is the worst OS I have seen. Don't bother shipping our package with it.
Dumbass. Go back to windows.

As for the quote from the GNU Maintainers guidelines, it's true, and you can find it in the maintaining GNU software document. It's just before halfway down, and there is a lot about bug reports there. But yes, it says So always thank each person who sends a bug report there. (And it does recognise that not all reports are useful.)

Filed. Now what?

Yes, these look extremely gloomy statistics. I glossed over this because I was out of time. But it's important to remember that I file a great many bugs, and some of them are on the level of missing or misplaced apostrophes in man pages or gratuitous commas. (Although actually, docs bugs with a clear suggestion of what they should say are some of the fastest to get fixed.)

By this time, you have generally upgraded.

Nothing to add here.

Questions

Enrico asked whether this would be on the web. I said yes. There you are, Enrico! The proper bunch of pages all sanely linked to each other and with more information are still on the way.

He also asked whether I wanted translations for bug-pages-when-done. Yes, please!

Someone asked how on earth I core-dumped wget. I didn't actually recall. I thought at the time I was asked that it was to do with getting a valid HTML page and so few pages were valid that the bug had never been spotted. I have now actually bothered to dig out my email about it (which was never answered, like so many...) and find that in fact I remembered it backwards: the page which was crashing wget it was generating invalid http stuff with an extra carriage return.

There was a question subsequent to the talk about whether I was serious with my numbers at the end. I said yes, absolutely. He looked rather taken aback. On reflection, I should qualify that. I have not actually counted up all my bugs and their eventual resolution. It's just an impression. Also, I am a sod for not replying to questions and bugs set NEEDINFO. (This is another great way to get your bug ignored.)

That was it. There is a great deal more I could have said. I didn't get a chance to mention some of my favourite bugs: the Mozilla one in which the page source and the rendered page differed if you were on a slow network connection; the Gnome one in which the translators started campaigning for better internationalisation of wind speeds (this is true); the Mozilla kitchen sink; or the time I reported a typo in a preference dialogue box and the developer fixed not only the user-visible parts but also the spelling of the verb all through the code where it was used as a function name. I also didn't get a chance to thank the people who have put up with my bug reports, the people who have put up with my questions about how to report them, and the mutt development team who responded to my very first bug report (with patch!) by fixing typos in a config file within a day and Michael Zucchi who responded with a chatty email to my second bug about problems with gnome-terminal. Had it not been for the mutt people and Michael responding, I'd have come to the conclusion it was all a waste of time and very hit and miss. (The former is not true; but the latter still is, I think.) But they responded, and thereby gave me the impression that it was worth reporting bugs. So thank you to them.

But whilst 45 minutes seemed a horrendous length of time when I started, it's not that long really.