Now we'll work on removing parts of the data that give inordinate trouble. For instance, when we look at all the daily (price) returns,

$rets=. ret1p&.>3{"1&.>CODAT
848
$;rets
1881993

we see that we have almost 1.9 million observations. However, we run into trouble as soon as we attempt to look at the return distribution:

load'mystats'
ss=: 'Returns Distribution' plotHistoMulti ;rets
|domain error: to
| x.+(1{y.)*(*x.-~0{y.)* i.>:<.0.5+(|x.-0{y.)%1{y.
usus ;rets
__ 0.87735849 __ _

This last set of numbers tells us our lowest return is negative infinity, our highest is 87.7%, the mean return is negative infinity and the standard deviation is infinity. The infinities are due to zeros in the price data, so let's see what the stats are if we remove these (negative) infinities from the returns:

These numbers are more reasonable but the low extreme is still doubtful: the lowest return is about -10,000%, and the mean is about -21%, which is possible but seems somewhat unlikely.
Looking at the ten lowest and highest returns tells us that there are some outliers:

We see about 4 low returns in the negative thousands of percent with a quick increase to a flatter (few hundreds) set of negative returns. On the positive side, the change from the extreme is more gradual so these outliers are less different from the body of the observations.
Some more work lets us know that a few of our data series are very short; the shortest are only one or two observations:

pxs=. _1-.~&.>3{"1&.>CODAT
<./#&>pxs
1
I. 1=#&>pxs
781 847

Companies at indexes 781 and 847 have only 2 and 1 observations:

(#&>CODAT){~10{./:#&>CODAT NB. 10 smallest data series
1 2 3 24 27 28 44 46 66 87

If we have to pick a cut-off below which we will discard a series, it looks like there's a natural break around 40. This would eliminate these companies:

I. 40>#&>pxs
220 781 844 845 846 847
COINFO{~I. 40>#&>pxs
+----+------------------------------------+--------+------+----------+
|HLT |Hilton Hotels Corp. |43284810|005643| 1/02/1990|
+----+------------------------------------+--------+------+----------+
|MNST|Monster Worldwide Inc. |61174210|064156| 6/04/2001|
+----+------------------------------------+--------+------+----------+
|COH |Coach Inc. |18975410|140541| 9/01/2004|
+----+------------------------------------+--------+------+----------+
|CIT |CIT Group Inc. |12558110|149738|10/27/2004|
+----+------------------------------------+--------+------+----------+
|LH |Laboratory Corp. of America Holdings|50540R40|014960|11/01/2004|
+----+------------------------------------+--------+------+----------+
|LLL |L-3 Communications Holdings Inc. |50242410|110685|12/01/2004|
+----+------------------------------------+--------+------+----------+

The latter four look like they joined the S&P near the end of our date range. The other two must have dropped out soon after the start of our date range (Hilton Hotels) or soon after being added to the index (Monster Worldwide).
So, let's remove these few companies based on how few observations they comprise. We have to adjust three of our global variables:

exclsz=. 40<#&>CODAT NB. Eliminate short series
'COIX CODAT'=: (<exclsz)#&.>COIX;<CODAT
COINFO=: (1,exclsz)#COINFO

Now, looking the returns with these removed, we see there are still some suspicious returns of -100%, so let's remove these as well and look at the statistics on the sum of our observations:

We still have 1.48 million observations so none of these eliminations have cost us too many observations.

rets=. _1-.~&.>ret1p&.>_1-.~&.>3{"1&.>CODAT
usus ;rets
_0.85158151 0.87735849 0.00046699481 0.024474664

(compare this (min, max, mean, std dev) to final winsorized stats).
Now the lowest and highest returns are of similar magnitude and the mean is slightly positive.

However, our graph of the distribution of returns is still dominated by the extremes:

ss2=. 'Returns Distribution After Initial Eliminations' plotHistoMulti ;rets

These don't look unusual but they make it hard to see the distribution of the main body of returns. To get a better look at this bulk of the observations, we'll use a technique called "winsorizing". This reins in the extreme values by capping them at some arbitrary value closer to the median of the observations.

So, let's winsorize the observations at both ends to concentrate on the bulk of the distributions for now. Trying a couple of different points at which to eliminate the outliers, I decided to winsorize at the 4000th observation on both ends.

NB.* winsorizeAt: winsorize top and bottom at xth highest/lowest observation.
winsorizeAt=: 4 : 0
grd=. /:y
y=. (y<:y{~x{grd) }y,:y{~x{grd
y=. (y>:y{~(-x){grd) }y,:y{~(-x){grd
)
ss2=. 'Returns Distribution - Winsorized at 4000th' plotHistoMulti 4000 winsorizeAt ;rets

(#rr)%~rr+/ . =4000 winsorizeAt rr
0.99461657

This gives a good look at the bulk of the data, over 99% of which is unaffected by winsorizing. You can see the winsorized data appearing in the highest and lowest bins of the histogram, making these a little taller than one would expect given the trend of the number of observations decreasing toward the tails of the distribution.
We can also look at our usual statistics on the winsorized data:

usus NB. The definition of "usus"
3 : 0
if. 0=L. y do. (]`|: @.(1<#$y))(<./,>./,mean (,`,: @.(1<#$y)) stddev) y
else. (<./,>./,mean,stddev)&> y end.
)
usus 4000 winsorizeAt rr
_0.09881698 0.10666229 0.00048500529 0.022861136

Notice that the mean and standard deviation remain very similar to those of the unwinsorized data ("usus" stats at end of observations).

## More Data Munging

Now we'll work on removing parts of the data that give inordinate trouble. For instance, when we look at all the daily (price) returns,we see that we have almost 1.9 million observations. However, we run into trouble as soon as we attempt to look at the return distribution:

This last set of numbers tells us our lowest return is negative infinity, our highest is 87.7%, the mean return is negative infinity and the standard deviation is infinity. The infinities are due to zeros in the price data, so let's see what the stats are if we remove these (negative) infinities from the returns:

These numbers are more reasonable but the low extreme is still doubtful: the lowest return is about -10,000%, and the mean is about -21%, which is possible but seems somewhat unlikely.

Looking at the ten lowest and highest returns tells us that there are some outliers:

We see about 4 low returns in the negative thousands of percent with a quick increase to a flatter (few hundreds) set of negative returns. On the positive side, the change from the extreme is more gradual so these outliers are less different from the body of the observations.

Some more work lets us know that a few of our data series are very short; the shortest are only one or two observations:

Companies at indexes 781 and 847 have only 2 and 1 observations:

These companies are:

The sizes of the ten smallest series are:

If we have to pick a cut-off below which we will discard a series, it looks like there's a natural break around 40. This would eliminate these companies:

The latter four look like they joined the S&P near the end of our date range. The other two must have dropped out soon after the start of our date range (Hilton Hotels) or soon after being added to the index (Monster Worldwide).

So, let's remove these few companies based on how few observations they comprise. We have to adjust three of our global variables:

Now, looking the returns with these removed, we see there are still some suspicious returns of -100%, so let's remove these as well and look at the statistics on the sum of our observations:

We still have 1.48 million observations so none of these eliminations have cost us too many observations.

(compare this (min, max, mean, std dev) to final winsorized stats).

Now the lowest and highest returns are of similar magnitude and the mean is slightly positive.

However, our graph of the distribution of returns is still dominated by the extremes:

Looking at the high and low extremes:

These don't look unusual but they make it hard to see the distribution of the main body of returns. To get a better look at this bulk of the observations, we'll use a technique called "winsorizing". This reins in the extreme values by capping them at some arbitrary value closer to the median of the observations.

So, let's winsorize the observations at both ends to concentrate on the bulk of the distributions for now. Trying a couple of different points at which to eliminate the outliers, I decided to winsorize at the 4000th observation on both ends.

This gives a good look at the bulk of the data, over 99% of which is unaffected by winsorizing. You can see the winsorized data appearing in the highest and lowest bins of the histogram, making these a little taller than one would expect given the trend of the number of observations decreasing toward the tails of the distribution.

We can also look at our usual statistics on the winsorized data:

Notice that the mean and standard deviation remain very similar to those of the unwinsorized data ("usus" stats at end of observations).