The most common phone number

Or, how to completely fuck up using 21st Century technology.

I explained this screw-up for a somewhat nontechnical audience on another website, so I’ll just copy-and-paste that here:

Yes, it really is happening. (The database software appears to be MySQL when it isn’t in strict mode, from what I gather.) I googled it and found places claiming locations in Edenton, NC, Seattle, WA, Montgomery, AL, and Leavenworth, KS all have this same number, with its Dallas-Fort Worth area code. It’s just… breathtakingly stupid. It’s the end result of a whole career’s worth of utter incomprehension coupled with an absolute refusal to run even the most rudimentary testing. This deserves to be in textbooks, someone (or some group) needs to be fired, and a lot of people ought to be apologized to in person.

When you do the reverse phone look-up you get a bunch of different home care places and rental car places around the country…and Will Stuart in Dallas, TX.

Sucks to be you Will.

This I don’t understand. If I wanted my computer to understand a number, for example 3,000,000,000, my computer would think I was typing 2,147,483,647? Aren’t there millions and millions of phone numbers that start with an area code larger than 214? I keep re-reading the OP, but I’m clearly missing something.

Somebody reads /r/programming then :stuck_out_tongue:

Not necessarily. Certainly not always. What caused this is that the database user told the database “This number can be stored in a fixed amount of space (32 bits, with one bit reserved to tell whether it’s positive or negative) and don’t bother me if it can’t be.” The first part is done for speed; the second part is done because the designer was Just. That. Stupid.

A lot of computer programs have parts which can’t handle numbers bigger than a certain threshold value; this isn’t actually a problem assuming the people making the software actually test it and ensure that limit isn’t relevant to what the software does.

Yes, and this is the really stupid part, you see: Intelligent people don’t store phone numbers as numbers, they store them as text*, as if they were sentences like “A moron designed some software once.” or “Shoot the idiot out the damned airlock.” There is no reason to store a phone number as a numeric value in a computer. You only do that if you are an idiot.

*(Or, possibly, binary-coded decimal, which acts more like text than a single integer value does.)

The system storing large numbers as 2,147,483,647 is one which uses 32 bits (binary digits) to store integers, so it can only store numbers between -2,147,483,648 and +2,147,483,647. If you need to work integers larger (or smaller) than that, then you need to store them in some other way. There is no way that you can store 10-digit telephone numbers as 32-bit data.

I do when Reddit is actually working, but I also read Hacker News.

While it certainly takes an idiot to store phone numbers as 32-bit numbers, I really don’t see the problem with storing them as 64-bit. 64 bits is 8 bytes, and that’s a more compact representation than 10 or more characters of text.

After doing a little research, it appears that the standard for “longest possible phone number”, including country code, is 15 characters. Even a signed 64-bit number provides more than enough room.

Sure, if you have to sanitize the number or some such, it’s easier to do it in text form, but you should only have to do that once when storing to the database. Pretty-printing on retrieval is trivial, and the fact that it’s numeric instead of text actually means you have less to worry about–no chance of rogue spaces, dashes, parens, etc. showing up.

So I don’t see the issue (besides the obvious, and humorous, 32 vs. 64-bit issue).

As someone who has programmed phone-related software for years, I have to say that any programmer who decided to use an integer to store what should only be done in a string is a very bad programmer. Aren’t the virtues of strings vs. numbers taught in Computer Science 101 anymore?

BCD (Binary-Coded Decimal) is the best choice for phone numbers. It is not quite as compact as other schemes, but it is easily manipulated and can be read by humans directly from a hex/byte display. BCD stores two digits in the space of one byte (one nybble per digit, which uses 0…9 and wastes A…F(Hex), or 37% of available storage, so it’s twice as efficient as ASCII or any other scheme using one byte per digit. It is also easier to extract a single digit or digits, like area codes, without having to use division or other mathematics on an integer, long integer, or (Og forbid), floating point.

OK, write me a fast query that tells me how many people we’re serving in a given region served by this area code and prefix. Note how much work has gone into making textual searching in databases fast, compared to how much has gone into making it easy to describe complex arithmetic in SQL.

“SELECT * FROM db WHERE (phone DIV 10000) = 1555555”

Well, my SQL is pretty rusty, but I’m fairly sure that works (I’m assuming the US here with a country code of 1). The arithmetic isn’t very complex.

That said, if I needed to do queries on area codes and prefixes, I would probably keep them as separate fields. Each of country code/area code/prefix/line number will fit into a 16-bit int. Total byte count is the same. Queries are then even easier/more readable/more reliable than with text.

Better make sure none of the country codes have leading 0s first.

And what would you do with a phone number that was “003”? Store it as 3? Then how would you transmit it? Sending a 3 won’t work if “003” is expected.

Folks, you don’t use numerical storage if your task requires string manipulation unless you are just a glutton for punishment. Learn it and deal with it. It’s Computer 101, fer chrissakes.

So are you locking out all non-NANP numbers? Because my cousin from Australia would be pissed if your website didn’t accept his 1-character area code (2 with the trunk dialling digit 0) and eight-digit local number.

And then there are the sites that barf when I enter my Canadian postal code. Or refuse to accept it with a space in the middle. People, if you’re going to have fussy exact formatting requirements, tell us first, before we enter the data!

According to the lists I’ve seen, none do. But even then, it would only be a problem if two country codes mapped to the same value after stripping off leading zeroes–“90” vs. “090”, say.

If which number was 003? Area codes and prefixes can’t start with 0. Line numbers can start with 0, but they’re never transmitted by themselves.

If the local number does not have a natural prefix/line number split like American numbers, then one could simply split it artificially (stick the first 4 digits in the prefix, and the next 4 in the line number field.

Ok, so you complain that this is complicated and arbitrary. True. But breaking up the number was motivated by searching the DB for US-specific fields. There are really only three options as I see it:

  1. You don’t care about detailed searches (everything for X area code, etc.); we’re just storing the number. A plain 64-bit int is enough no matter what country you’re in.
  2. You do need detailed searches, but staying within the US (or countries with similar rules) is enough. 3 or 4 16-bit fields is enough.
  3. You need detailed searches, but need to handle every country on the planet. Here you’re forced anyway into implementing every nitpicky difference between countries, and dealing with the fact that prefixes don’t exist everywhere and so on. One could use solution 1 with math ops, or solution 2 with some mapping algorithm, or just use text. None of the solutions avoid the problem of having to implement rules for every individual country.

I will grant that if you want to store something beyond a phone number–extensions, or DTMF command sequences, say–then using an integer gets unwieldy or impossible. So yeah, definitely use a string in that case.

That’s for sure. But it’s part of why I lean towards numeric storage, all else being equal: it forces the programmer to sanitize the inputs right there. With a string, the DB is liable to end up filled with “(555) 555-1212” vs. “555-555-1212” vs. “555.555.1212”, and so on. Whatever is reading from the DB now has to deal with that, and by that time it’s too late to tell the user that they screwed up.

But part of the programmer’s job is to take on complexity in order to make make things simpler for the user. The programmer only has to deal with the complexity once, when writing the program. If you leave the complexity out there for the users, they have to deal with it over and over and over again.

The phone number input field should be able to accept and parse all input that can reasonably contain a phone number. Maybe you can have an orerarching country selection that chooses how you will parse later things like phone number and address, and for phone numbers maybe you can exclude alpha characters, but you should be able to handle /, -, (, ), period, and space separating digits. I’m surprised there isn’t a canned chunk of code to do this before passing the sanitized data on to the DB.

Yes, absolutely. The webform or whatever should be as accepting as possible for unambiguous input. However, the programmer can’t anticipate every possible combination of inputs. When this happens, the error should propagate back to the user immediately.

Of course, regardless of how the data is stored, input needs to be sanitized first (for security reasons if nothing else!). But a string field in the DB means that it’s possible for unsanitized or undersanitized data to make its way in. Maybe some rookie programmer misunderstood the spec and left dashes in the number, or some such, and then future substring searches don’t work because they use the wrong character offsets. Storing as an integer makes it harder for that kind of problem to appear.

Anyway, I’m by no means dead-set against phone-numbers-as-text; I just dispute the notion that only an idiot would ever use ints. Both approaches have advantages and disadvantages. And as was said, BCD might be better yet in some cases.