I was in Helsinki last week for the UNU-WIDER workshop on the Vietnam Access to Resources Household Survey (VARHS) data, presenting work that my coauthors and I have been doing using these data.
One thing that I saw a few instances of during the workshop was the following. A researcher wants to a variable x in a regression, but that variables needs to be logged. Because there are many zero-valued observations of x, and because log(0) is undefined, the author simply uses log(x +1), or log(x + 0.001), or log(x + 0.00001), and so on.
This post is about what to do in such cases. There are many instances in development where you’d like to include a financial variable–say, the value of chemical fertilizer used on a given plot, for example–where many observations will have a zero-valued observation–in the chemical fertilizer example, not everyone in the data will use chemical instead of organic fertilizer, and so they will report a zero when you ask them what was the value of chemical fertilizer used on any of their plots.
When you want to log a variable x but that x has many zero-valued observations, there are three things you can do in principle: