More Data Munging

Now we'll work on removing parts of the data that give inordinate trouble. For instance, when we look at all the daily (price) returns,
 $rets=. ret1p&.>3{"1&.>CODAT
848
 $;rets
1881993
we see that we have almost 1.9 million observations. However, we run into trouble as soon as we attempt to look at the return distribution:
 load'mystats'
 ss=: 'Returns Distribution' plotHistoMulti ;rets
|domain error: to
| x.+(1{y.)*(*x.-~0{y.)* i.>:<.0.5+(|x.-0{y.)%1{y.
 usus ;rets
__ 0.87735849 __ _
This last set of numbers tells us our lowest return is negative infinity, our highest is 87.7%, the mean return is negative infinity and the standard deviation is infinity. The infinities are due to zeros in the price data, so let's see what the stats are if we remove these (negative) infinities from the returns:
 replInf0=: 3 : '(y.=__)}y.,:0'
 rets=. replInf0&.>rets
 usus ;rets
_101.6 0.87735849 _0.21019811 0.42054006
These numbers are more reasonable but the low extreme is still doubtful: the lowest return is about -10,000%, and the mean is about -21%, which is possible but seems somewhat unlikely.
Looking at the ten lowest and highest returns tells us that there are some outliers:
 (;rets){~10{./:;rets
_101.6 _84.94 _28.25 _26 _4.3333333 _3.7777778 _1.8849558 _1.8333333 _1.8064516 _1.729927
 (;rets){~10{.\:;rets
0.87735849 0.78333333 0.76470588 0.75324675 0.62962963 0.56045752 0.54189944 0.52631579 0.51775148 0.46969697
We see about 4 low returns in the negative thousands of percent with a quick increase to a flatter (few hundreds) set of negative returns. On the positive side, the change from the extreme is more gradual so these outliers are less different from the body of the observations.
Some more work lets us know that a few of our data series are very short; the shortest are only one or two observations:
 pxs=. _1-.~&.>3{"1&.>CODAT
 <./#&>pxs
1
 I. 1=#&>pxs
781 847
Companies at indexes 781 and 847 have only 2 and 1 observations:
 $&.>781 847 { CODAT
+---+---+
|2 6|1 6|
+---+---+
 
These companies are:
 781 847 { COINFO
+----+--------------------------------+--------+------+----------+
|MNST|Monster Worldwide Inc. |61174210|064156| 6/04/2001|
+----+--------------------------------+--------+------+----------+
|LLL |L-3 Communications Holdings Inc.|50242410|110685|12/01/2004|
+----+--------------------------------+--------+------+----------+

The sizes of the ten smallest series are:
 (#&>CODAT){~10{./:#&>CODAT NB. 10 smallest data series
1 2 3 24 27 28 44 46 66 87
If we have to pick a cut-off below which we will discard a series, it looks like there's a natural break around 40. This would eliminate these companies:
 I. 40>#&>pxs
220 781 844 845 846 847
 COINFO{~I. 40>#&>pxs
+----+------------------------------------+--------+------+----------+
|HLT |Hilton Hotels Corp. |43284810|005643| 1/02/1990|
+----+------------------------------------+--------+------+----------+
|MNST|Monster Worldwide Inc. |61174210|064156| 6/04/2001|
+----+------------------------------------+--------+------+----------+
|COH |Coach Inc. |18975410|140541| 9/01/2004|
+----+------------------------------------+--------+------+----------+
|CIT |CIT Group Inc. |12558110|149738|10/27/2004|
+----+------------------------------------+--------+------+----------+
|LH |Laboratory Corp. of America Holdings|50540R40|014960|11/01/2004|
+----+------------------------------------+--------+------+----------+
|LLL |L-3 Communications Holdings Inc. |50242410|110685|12/01/2004|
+----+------------------------------------+--------+------+----------+
The latter four look like they joined the S&P near the end of our date range. The other two must have dropped out soon after the start of our date range (Hilton Hotels) or soon after being added to the index (Monster Worldwide).
So, let's remove these few companies based on how few observations they comprise. We have to adjust three of our global variables:
 exclsz=. 40<#&>CODAT NB. Eliminate short series
 'COIX CODAT'=: (<exclsz)#&.>COIX;<CODAT
 COINFO=: (1,exclsz)#COINFO
Now, looking the returns with these removed, we see there are still some suspicious returns of -100%, so let's remove these as well and look at the statistics on the sum of our observations:
 rets=. ret1p&.>_1-.~&.>3{"1&.>CODAT
 $;rets
1881564
 usus ;rets
_1 0.87735849 _0.21003814 0.40836759
 $_1-.~;rets
1485670
We still have 1.48 million observations so none of these eliminations have cost us too many observations.
 rets=. _1-.~&.>ret1p&.>_1-.~&.>3{"1&.>CODAT
 usus ;rets
_0.85158151 0.87735849 0.00046699481 0.024474664
(compare this (min, max, mean, std dev) to final winsorized stats).
Now the lowest and highest returns are of similar magnitude and the mean is slightly positive.

However, our graph of the distribution of returns is still dominated by the extremes:
 ss2=. 'Returns Distribution After Initial Eliminations' plotHistoMulti ;rets
retAfterInitialEliminations.png

Looking at the high and low extremes:
 5 10$0.001 roundNums rr{~50{./:rr NB. 50 lowest returns
_0.852 _0.851 _0.766 _0.761 _0.636 _0.623 _0.614 _0.61 _0.597 _0.585
_0.573 _0.537 _0.528 _0.519 _0.496 _0.495 _0.495 _0.494 _0.493 _0.49
 _0.48 _0.48 _0.475 _0.471 _0.467 _0.465 _0.46 _0.455 _0.446 _0.445
_0.445 _0.439 _0.438 _0.436 _0.425 _0.424 _0.424 _0.422 _0.422 _0.418
_0.415 _0.41 _0.409 _0.408 _0.405 _0.401 _0.4 _0.4 _0.4 _0.398
 
 5 10$0.001 roundNums rr{~50{.\:rr NB. 50 highest returns
0.877 0.783 0.765 0.753 0.677 0.63 0.56 0.542 0.526 0.518
 0.47 0.458 0.451 0.45 0.438 0.427 0.424 0.42 0.41 0.409
0.405 0.394 0.391 0.389 0.384 0.384 0.379 0.369 0.368 0.367
0.366 0.363 0.36 0.358 0.355 0.353 0.351 0.348 0.348 0.348
0.347 0.339 0.338 0.333 0.332 0.331 0.331 0.33 0.33 0.326
These don't look unusual but they make it hard to see the distribution of the main body of returns. To get a better look at this bulk of the observations, we'll use a technique called "winsorizing". This reins in the extreme values by capping them at some arbitrary value closer to the median of the observations.

So, let's winsorize the observations at both ends to concentrate on the bulk of the distributions for now. Trying a couple of different points at which to eliminate the outliers, I decided to winsorize at the 4000th observation on both ends.
NB.* winsorizeAt: winsorize top and bottom at xth highest/lowest observation.
winsorizeAt=: 4 : 0
 grd=. /:y
 y=. (y<:y{~x{grd) }y,:y{~x{grd
 y=. (y>:y{~(-x){grd) }y,:y{~(-x){grd
)
 ss2=. 'Returns Distribution - Winsorized at 4000th' plotHistoMulti 4000 winsorizeAt ;rets
retDistWins4000th.png

   (#rr)%~rr+/ . =4000 winsorizeAt rr
0.99461657
This gives a good look at the bulk of the data, over 99% of which is unaffected by winsorizing. You can see the winsorized data appearing in the highest and lowest bins of the histogram, making these a little taller than one would expect given the trend of the number of observations decreasing toward the tails of the distribution.
We can also look at our usual statistics on the winsorized data:
   usus           NB.  The definition of "usus"
3 : 0
   if. 0=L. y do. (]`|: @.(1<#$y))(<./,>./,mean (,`,: @.(1<#$y)) stddev) y
   else. (<./,>./,mean,stddev)&> y end.
)
 
   usus 4000 winsorizeAt rr
_0.09881698 0.10666229 0.00048500529 0.022861136
Notice that the mean and standard deviation remain very similar to those of the unwinsorized data ("usus" stats at end of observations).