Statistics

(This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers) I started working with R 2 1/2 years ago. I remember opening R closing it and thinking it was the dumbest thing ever (command ...
(This article was first published on TRinker's R Blog » R, and kindly contributed to R-bloggers) I started working with R 2 1/2 years ago. I remember opening R closing it and thinking it was the dumbest thing ever (command line to a non programmer is not inviting). Now it’s my constant friend. From the beginning I took notes to remind myself all of the things I learned and relearned. They’ve been invaluable to me in learning. They are not particularly well arranged nor do they credit sources properly. There are likely bad or outdated practices in there but I figured they may be helpful to others learning the language and so I’m sharing. Note that : 1) they are poorly arranged 2) they may have mistakes 3) they don’t credit others work properly or at all They were for me but now I think maybe others will find them useful so here they are: click here *Note that the file is larger ~7000KB and 274 pages worth. To leave a comment for the author, please follow the link and comment on his blog: TRinker's R Blog » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
score: 1 about 6 hours ago
One of the movies I watched during my hospitalisation is detachment, by Tony Kaye, with Adrian Brody as the lead actor. My daughter brought it to me as she remembered I was interested in it. detachment is a strong and highly original mov...
One of the movies I watched during my hospitalisation is detachment, by Tony Kaye, with Adrian Brody as the lead actor. My daughter brought it to me as she remembered I was interested in it. detachment is a strong and highly original movie about the U.S. school system and the complete lack of prospects for the students in deprived suburbs. I have seen several movies of that kind in the past, some of them rather good and keeping away from the fairy tale that an exceptional teacher is enough to rescue a class cohort or even a single student from a bleak future. This one is however the most pessimistic of all, with no happy ending of any sort (except for the last minute that should have been cut). The plot is not flawless, e.g. the main teacher redemption of the young prostitute being just too unrealistic, but the burnout of the teachers, the newspeak preaching of the administration, the nihilism of the high school students, the bullying of unusual students, and the complete absolute absence of the parents (unless I am confused we only see one [screaming] mother once, no parent shows up at parents’ night and the bullying father is only a voice…) make up for those flaws. Adrian Brody is delivering a superb performance in a great movie, sadly about a terrible issue with our educational system(s)… Filed under: Books, Kids Tagged: Adrian Brody, detachment, high school, movie review
score: 1 about 9 hours ago
(This article was first published on Econometrics_Help, and kindly contributed to R-bloggers) Every month I see one or more new R based web server solutions coming into the market, sight seeing some of them thought of sharing on...
(This article was first published on Econometrics_Help, and kindly contributed to R-bloggers) Every month I see one or more new R based web server solutions coming into the market, sight seeing some of them thought of sharing one of my old architecture map manifested to the client long back in early 2009 (good to see quick spreading of scalable and customizable open source statistical computing tool in the market). To leave a comment for the author, please follow the link and comment on his blog: Econometrics_Help. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
score: 1 about 15 hours ago
(This article was first published on bayesianbiologist » Rstats, and kindly contributed to R-bloggers) I am currently working on a validation metric for binary prediction models. That is, models which make predictions about outc...
(This article was first published on bayesianbiologist » Rstats, and kindly contributed to R-bloggers) I am currently working on a validation metric for binary prediction models. That is, models which make predictions about outcomes that can take on either of two possible states (eg Dead/not dead, heads/tails, cat in picture/no cat in picture, etc.) The most commonly used metric for this class of models is AUC, which assesses the relative error rates (false positive, false negative) across the whole range of possible decision thresholds. The result is a curve that looks something like this: Where the area under the curve (the curve itself is the Receiver Operator Curve (ROC)) is some value between 0 and 1. The higher this value, the better your model is said to perform. The problem with this metric, as many authors have pointed out, is that a model can perform very well in terms of AUC, but be completely miscalibrated in terms of the actual probabilities placed on each outcome. A model which distinguishes perfectly between positive and negative cases (AUC=1) by placing a probability of 0.01 on positive cases and 0.001 on negative cases may be very far off in terms of the actual probability of a positive case. For instance, positive cases may actually occur with probability 0.6 and negative cases with 0.2. In most real situations, our models will predict a whole range of different probabilities with a unique prediction for each data point, but the general idea remains. If your goal is simply to distinguish between cases, you may not care whether the probabilities are not correct. However, if your model is purporting to quantify risk then you very much want to know if you are placing the probabilistically true predictions on cases that are yet to be observed. Which begs the question: What is probabilistic truth?  This questions appears, at least at first, to be rather simple. A frequentist definition would say that the probability is correct, or true, if the predicted probability is equal to the long run outcomes.  Think of a dice rolled over and over counting the number of times a one is rolled. We would compare this frequency to our predicted probability of rolling a one (1/6 for a fair six-sided die) and would say that our predicted probability was true if this frequency matched 1/6. But what about situations where we can’t re-run an experiment over and over again? How then would we evaluate the probabilistic truth of our predictions? I’ll be working through this problem in a series of posts in the coming weeks. Stay tuned! To leave a comment for the author, please follow the link and comment on his blog: bayesianbiologist » Rstats. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
score: 1 about 15 hours ago
(This article was first published on Blog - Applied Predictive Modeling, and kindly contributed to R-bloggers) Here is a summary of some recent changes to caret. Feature Updates: train was updated to utilize recent changes i...
(This article was first published on Blog - Applied Predictive Modeling, and kindly contributed to R-bloggers) Here is a summary of some recent changes to caret. Feature Updates: train was updated to utilize recent changes in the gbm package that allow for boosting with three or more classes (via the multinomial distribution) The Yeo-Johnson power transformation was added. This is very similar to the Box-Cox transformation, but it does not require the data to be greater than zero. New models referenced by train: Maximum uncertainty linear discriminant analysis (Mlda) and factor-based linear discriminant analysis (RFlda) from the HiDimDA package were added. The kknn.train model in the kknn package was added. This is basically a more intelligent K-nearest neighbors model that can use distance weighting, non-Euclidean distances (via the o Minkowski distance) and a few other features. The extraTrees function in the package of the same name was added. This generalizes the random forest model by adding randomness to the predictors and the split values that are evaluated at each split point. Numerous bugs were also fixed in the last few releases. The new version is 5.16-04. Feel free to email me at mxkuhn@gmail.com if you have any feature requests or questions. To leave a comment for the author, please follow the link and comment on his blog: Blog - Applied Predictive Modeling. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
score: 1 about 19 hours ago
Filed under: Kids, pictures, Travel Tagged: E15 highway, graffitis, Paris suburbs, tags
Filed under: Kids, pictures, Travel Tagged: E15 highway, graffitis, Paris suburbs, tags
score: 1 about 20 hours ago
(This article was first published on Wiekvoet, and kindly contributed to R-bloggers) I was reading Paul Hiemsta's blogpost on Much more efficient bubble sort in R using the Rcpp and inline packages, went back to his first post ...
(This article was first published on Wiekvoet, and kindly contributed to R-bloggers) I was reading Paul Hiemsta's blogpost on Much more efficient bubble sort in R using the Rcpp and inline packages, went back to his first post Bubble sort implemented in pure R and thought, surely we can do it better in pure R. So I cleaned inner loops, removed a recursion made some variations and finally made a big improvement as I vectorized it. I don't need to try Rcpp to know that is and will be faster, but surely closed the gap a bit.Original codeThis is how it was originally in the blog in pure R. If there are changes it is layout.larger = function(pair) { if(pair[1] > pair[2]) return(TRUE) else return(FALSE)}swap_if_larger = function(pair) { if(larger(pair)) { return(rev(pair)) } else { return(pair) }}swap_pass = function(vec) { for(i in seq(1, length(vec)-1)) { vec[i:(i+1)] = swap_if_larger(vec[i:(i+1)]) } return(vec)}bubble_sort = function(vec) { new_vec = swap_pass(vec) if(isTRUE(all.equal(vec, new_vec))) { return(new_vec) } else { return(bubble_sort(new_vec)) }}Improve inside of loopsThis is essentially the same code, but cleaned the first two functions. larger1 = function(pair) pair[1] > pair[2]swap_if_larger1 = function(pair) { if(larger1(pair)) rev(pair) else pair}swap_pass1 = function(vec) { for(i in seq(1, length(vec)-1)) { vec[i:(i+1)] = swap_if_larger1(vec[i:(i+1)]) } return(vec)}bubble_sort1 = function(vec) { new_vec = swap_pass1(vec) if(isTRUE(all.equal(vec, new_vec))) { return(new_vec) } else { return(bubble_sort1(new_vec)) }}Improve outside loopsI cleaned the outside loop, then decided that the two inside functions were not needed any more, because it was reduced to one statement.swap_pass2 = function(vec) { for(i in 1:(length(vec)-1)) { if (vec[i] > vec[i+1] ) vec[c language="(i,i+1)"][/c] } return(vec)}bubble_sort2 = function(vec) { new_vec = swap_pass2(vec) if(identical(vec, new_vec)) new_vec else bubble_sort2(new_vec)}No recursionThis was actually not intended a speed update, but R was complaining about too deep recursion. The same swapp_pass2 as before is used. bubble_sort3 = function(vec) { new_vec = swap_pass2(vec) while (!identical(vec, new_vec)) { vec new_vec = swap_pass2(vec) } new_vec }A different setupThe way I understood bubblesort is different from what is programmed so far. So far the sort is completed when no improvements are made, which is checked via a vector comparison. I always understood you don't check, the first bubble goes to the end, at which point the last element is the maximum. The second bubble then stops at end-1 etc. The final bubble is only the first two elements, after which the algorithm is finished. swap_pass4 = function(vec,iend) { for(i in 1:iend) { if (vec[i] > vec[i+1] ) vec[c language="(i,i+1)"][/c] } return(vec)}bubble_sort4 = function(vec) { for (iend in (length(vec)-1):1) vec vec}Tuning Paul's original algorithmI was a bit disappointed with the improvements from using the true bubble sort. Apparently there is some gain by checking if the process is completed. Which is logical since the bubbles do some sorting of the intermediate elements. So, what about bubbles that go up and go down combined with a check on no improvements?swap_pass3b = function(vec) { for(i in length(vec):2) { if (vec[i] } return(vec)}bubble_sort3b = function(vec) { new_vec = swap_pass2(vec) while (!identical(vec, new_vec)) { new_vec = swap_pass2(vec) vec } vec }Vectorizing the bubblesWhen you look at the bubblesort without checking, it seems it is only assumed that the first step brings the highest element at the last position. This can be achieved by just pulling the highest element and placing it at the end, without the intermediate swaps.bubble_sort5 = function(vec) { wm vec for (iend in ((length(vec)-1):2)) { wm vec } vec}ResultsFirst a demonstration that they work;test_vec = r
score: 1 about 23 hours ago
Kenneth Feinberg explains his approach to deciding how to distribute funds for victims of tragedies such as the Boston Marathon bombings.
Kenneth Feinberg explains his approach to deciding how to distribute funds for victims of tragedies such as the Boston Marathon bombings.
score: 1 1 day ago
(This article was first published on lukemiller.org » R-project, and kindly contributed to R-bloggers) XTide is an open-source program that predicts tide heights and current speeds for hundreds of tide and current stations aroun...
(This article was first published on lukemiller.org » R-project, and kindly contributed to R-bloggers) XTide is an open-source program that predicts tide heights and current speeds for hundreds of tide and current stations around the United States. It can be used to produce tide predictions in the past and future for a site at your chosen interval (down to the minute), as well as producing sunrise and sunset times, [...] To leave a comment for the author, please follow the link and comment on his blog: lukemiller.org » R-project. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
score: 1 1 day ago
(This article was first published on factbased, and kindly contributed to R-bloggers) Last weekend I submitted an update of my R package datamart to CRAN. It has been more than a half year since the last update, however there ar...
(This article was first published on factbased, and kindly contributed to R-bloggers) Last weekend I submitted an update of my R package datamart to CRAN. It has been more than a half year since the last update, however there are only minor advances. The package is still in its early stages, and very experimental.One new feature is the function uconv. Think iconv, but instead of converting character vectors between different encodings, this function converts numerical vectors between different units of measurements. Now if you want to know how many centimeters one horse length is, you can write in R:> #install.packages("datamart") > library(datamart) > uconv(1, "horse length", "cm") and you will get the answer 240. I had the idea for this function when I had to convert between various energy units, including natural units of energy fuels like cubic metres of natural gas. The uconv function supports this, using common constants for the conversion.> uconv(1, "Mtoe", "PJ") [1] 41.88 > uconv(1, "m³ NG", "kWh") [1] 10.55556 These conversions may be ambigious. For instance, the last one combines a volume and an energy dimension. An optional parameter allows the specification of the context, or unitset:> uconv(1, "Mtoe", "PJ", uset="Energy") The currently available unit sets and units therein can be inspected with> uconvlist() The first argument can be a numerical vector:> set.seed(13) > uconv(37+2*rnorm(5), "°C", "°F", uset="Temperature") [1] 100.59558 97.59102 104.99059 99.27435 102.71309 To leave a comment for the author, please follow the link and comment on his blog: factbased. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
score: 1 1 day ago