New Relic Now Start training on Intelligent Observability February 25th.
Save your seat.
No momento, esta página está disponível apenas em inglês.

This post was originally published on July 8, 2014 as “Seven Things You Should Never Code Yourself.” It was updated and expanded on March 12, 2019.

As programmers, we like to solve problems. We like it when ideas spring from our heads and travel through our fingertips to create magical solutions.

But sometimes, we can be too quick to start cranking out code to solve a problem. We may immediately roll up our sleeves and dive in—never considering whether someone else may have solved a similar problem and published code that has already been written, tested, and debugged.

Sometimes, we need to stop and think before we start typing.

These nine commonly encountered coding problems, for example, are almost always better solved using an existing solution, rather than trying to code your own:

1. Parsing HTML or XML

Based on the number of times coders ask about this topic on StackOverflow, many apparently underestimate the complexity of parsing HTML or XML. Extracting data from arbitrary HTML looks deceptively simple, but it’s really a job that you should leave to libraries.

Say you’re looking to extract a URL from the src attribute of an <img> tag:

<img src="foo.jpg">

The quickest solution would be use to use a simple regular expression (regex) with a capture to match the pattern:

/<img src="(.+?)">/

The string foo.jpg will be in capture group #1 and can be assigned to a

string. But what if the tag has other attributes?

<img id="bar" src="foo.jpg">

Will it handle alternate quotes?

<img src='foo.jpg'>

Or no quotes at all?

<img src=foo.jpg>

What about if the tag spans multiple lines and is self-closing?

<img id="bar"

src="foo.jpg"

/>

And will your code know to ignore this commented-out tag?

<!--

<img src="foo.jpg">

-->

It’s a seemingly endless cycle: Find yet another valid case your code doesn’t handle, modify your code, retest it, and try it again. Or you could use a proper library, and save yourself a lot of time

2. Parsing CSV files

CSV files look simple, but they’re actually fraught with peril. The following set of comma-separated values, for instance, looks trivial to parse, right?

# ID, name, city

1, Queen Elizabeth II, London

Sure it is ... until you have double-quoted values with embedded commas:

2, J. R. Ewing, "Dallas, Texas"

And once you get around those double-quoted values, what happens when you have a string with embedded double quotes that have to be escaped?

3, "Larry \"Bud\" Melman", "New York, New York"

You can get around those, too, until you have to deal with embedded newlines in the middle of a record.

Save yourself the hassle and the risk of errors: Any data that you can't handle with splitting the string on a comma you should leave to a library.

If it’s bad to read structured data in an unstructured way, it’s even worse to try to modify it in place. Something as seemingly simple as, “I want to change any 5th field in this CSV with the name Bob to Steve” is dangerous because, as noted above, counting commas isn’t good enough.

To be safe, you need to read the data—using a comprehensive library—into an internal structure, modify the data, and then write it back out with the same library. Doing it any other way risks corrupting the data if its structure doesn’t precisely match your expectations.

3. Extracting data from JSON

JSON has all the same data type hazards as CSV, with the added headache of being able to store multi-level data structures.

A common question on StackOverflow involves someone curious about how to extract data from a JSON file,  from a web API. For example, "How can I get just the McHenry phone number?"

{"name":"Tacos El Norte","tags":["Tacos","Mexican"],"phone":{"McHenry":

"815-759-9227","Libertyville":"847-837-3488","Waukegan":"847-263-9001"}}

In this case, a regular expression won’t work because JSON is structured, and you can’t easily parse the data with a simple regular expression. You can't just search for "McHenry" and take whatever follows because "McHenry" might appear in multiple fields.

Another complicating factor is that you can break JSON up into multiple lines. Just as multi-line tags can cause problems when extracting data from XML and HTML with a regular expression, you may have the same problems with JSON.

If you’re doing anything with JSON from the command line, there's an invaluable tool called jq that parses and reformats JSON for you.

Passing our JSON file to jq instantly formats it for us:

$ jq . tacos.json

{

  "name": "Tacos El Norte",

  "tags": [

    "Tacos",

    "Mexican"

  ],

  "phone": {

    "McHenry": "815-759-9227",

    "Libertyville": "847-837-3488",

    "Waukegan": "847-263-9001"

  }

}

We can then use simple queries to show only the phone-numbers section of the

data structure:

$ jq .phone tacos.json

{

  "McHenry": "815-759-9227",

  "Libertyville": "847-837-3488",

  "Waukegan": "847-263-9001"

}

And then just one phone number:

$ jq .phone.McHenry tacos.json

"815-759-9227"

That’s fine for extracting data from the command line, but most of the time you’ll be working in a programming language. Fortunately, JSON is so common that every programming language has at least one library or module that will parse it. In Python, for example, you'd write:

import json



restaurant = json.load(open('tacos.json'))



print(restaurant['phone']['McHenry'])

And in PHP you'd write:

$restaurant = json_decode(file_get_contents('tacos.json'));



print $restaurant->phone->McHenry;

With so many tools at your disposal, the code required to extract data from JSON is code that you should never need to write yourself.

4. Email address validation

There are two ways to validate an email address: You can do it with a simple check or validate it against the rules in RFC 2822.

Let’s say you want a simple check that verifies an email address has non-whitespace characters, an @ sign, and then some more non-whitespace characters. You might use this regex:

/^\S+@\S+$/

This regex isn’t complete, and it lets invalid stuff through, but at least it will confirm the presence of the @ sign in the middle.

You could also validate the email address against the rules in RFC 2822, which defines the standard format for email addresses. These rules are far more complex than you may realize. A simple regex isn’t going to do the job, even if you knew all the rules in the RFC. You need to use a library that interprets and applies the rules correctly.

If you’re not going to validate against RFC 2822 in its entirety, then you could at least validate against a reasonable subset of the rules. That’s a valid design tradeoff in many situations, but don’t fool yourself into thinking that you’ve covered all the cases unless you go back to the full RFC, or use a library written by someone who has.

5. Processing URLs

URLs aren’t nearly as odious to deal with as email addresses, but they’re still full of annoying rules that you have to remember: What characters do you need to encode? How do you handle spaces? What do you do with + signs? What characters are valid after a # sign?

For whatever language you’re working in, you can expect to find libraries that can break apart URLs into the components you need and then reassemble them, properly formatted. You’ll also find code that can validate URLs.

Say you had the URL:

https://beta.example.com:8000/r/example

To extract just the hostname, you could probably use a regex, or you could use your language’s standard function. For example, to use PHP’s built-in parse_url function:

$url = 'https://beta.example.com:8000/r/example';

$host = parse_url($url, PHP_URL_HOST);

Every language has URL manipulation functions. Use them.

6. Date/time manipulation

At first, it may seem easy to wrap your head around all the rules for date/time manipulation. But things can get complicated quickly; you may have to account for multiple time zones, daylight savings time, leap years, and even leap seconds. For instance, you might have to figure out when 10 days after the current date is, or calculate the number of minutes between two times.

This is hard enough when dealing only with United States time zones. It gets even more complicated when you’re working globally. (Did you know that some time zones differ from adjacent ones by minutes, not by whole hours?)

Even something as simple as validating a date may have corner cases. It's not just a matter of "Thirty days hath September..." to figure out how many days are in a month. Did you know 2000 was a leap year with 29 days in February, but the year 2100 won't be a leap year?

Why bother tracking all of these variables when an existing library has already done it for you? Whether you’re performing date arithmetic to calculate a specific amount of time on the calendar, or you’re validating that an input string is in fact a valid date, use an existing library.

7. Templating systems

It’s almost a rite of passage for programmers to create boilerplate text:

Dear #user#,

Thank you for your interest in #product#...

Your first version of a template like this one may work for a while. But then you have to add multiple output formats and numeric formatting, and then you have to output structured data in tables, and on and on—until eventually, you’ve built an ad hoc monster requiring endless care and feeding.

If you’re doing anything more complex than simple string-for-string substitution, step back and find yourself a good templating library. To make things even simpler, if you’re writing in PHP, the language itself is a templating system (though that is often not its main use case these days).

8. Logging frameworks

In many cases, logging tools start small and grow into behemoths. You write one simple function for logging to a file, but then you have to revise it so that it can log to multiple files, or send email notifications on completion, or have varying log levels, and so on.

Fortunately, most languages have at least three log packages that have been around for years and will save you no end of aggravation. One minor standard that has developed is the log4(languagename) convention. Back in 2001, a Java logging library called log4j was released. It became popular and was adapted to other languages. Now there are libraries such as log4php, log4py, and log4go (for PHP, Python, and Go, respectively). If you’re looking for a logging framework, start by searching for log4(languagename).

9. Security and encryption

In the cases we've looked at so far, I recommended using existing code to save yourself the time and hassle. In the case of security and encryption code, however, there’s a more important reason: You're unlikely to do it right on your own.

According to Schneier’s Law, "Anyone can invent an encryption algorithm they themselves can't break; it's much harder to invent one that no one else can break." Encryption algorithms are tested and attacked by experts who are much better at it than you are. (For more examples of why you should reuse existing code for security and encryption, see, "Why shouldn’t we roll our own" in the Security Stack Exchange.)

Whether you’re using database bind variables to avoid SQL injection or salting passwords when creating hashes, it’s critical not to take shortcuts. Follow the recommended practices. Even if you think that your way is just as good, it probably isn't. Trust the professionals: Leave security and encryption to the experts, and do what they say.

Isn’t this overkill?

Sometimes, programmers don't want to use existing code. We’re proud of our skills, and we like the process of creating code to solve our problems. That’s fine, but often the best way to solve a problem is to write as little code as possible. In software, the most expensive time is a programmer's time. Including an additional library in your software may add a few milliseconds of program execution time that nobody is likely to notice. What won’t go unnoticed are the hours or days you burn tracking down bugs from code that you wrote—but didn't need to.