Statistics

add news feed

post a story

During the recent Kalido webinar on data science, I was asked a number of questions about data science, which have since been published as a Kalido Expert View. Here's my take on the first question: Q: In your opinion, what is a dat...
During the recent Kalido webinar on data science, I was asked a number of questions about data science, which have since been published as a Kalido Expert View. Here's my take on the first question: Q: In your opinion, what is a data scientist? I find it most useful to explain what a data scientist is in the context of data scientist vs. statistician. I was trained as a statistician and used to introduce myself as one, but the image most people got was that I counted runs in baseball or cricket. That’s obviously not what a data scientist does. Data science is more like the next evolution of statistics. Rather than being a reactive type of process, it’s very consultative and you work with people all across the organization — IT to prepare the data, the business units to figure out what the problem is, operations to actually deliver the results to the organization. A data scientist will start off with a problem the business has and actually go out and look for the data. We don’t start off with a nice, clean data file. Instead, we’re tasked with figuring out what insight exists in a messy data source and finding the results. It’s a forward-looking process. We’re actually using statistical techniques and data science applications to predict the future and to help business users answer specific questions. So we’re shifting from questions around, “what happened in the data?” to “what will happen based on this data analysis? If we change that variable, what will the outcome be?” You can find my take on the other questions, Q: What are the top two or three characteristics you would look for when trying to hire a data scientist? Q: How do you see the data science “method” evolving? at the link below. Kalido Expert View: Three Questions for David Smith VP Marketing, Revolution Analytics
about 4 hours ago
Yet another puzzle which first part does not require R programming, even though it is a programming question in essence: Given five real numbers x1,…,x5, what is the minimal number of pairwise comparisons needed to rank them? Give...
Yet another puzzle which first part does not require R programming, even though it is a programming question in essence: Given five real numbers x1,…,x5, what is the minimal number of pairwise comparisons needed to rank them? Given 33 real numbers, what is the minimal number of pairwise comparisons required to find the three largest ones? I do not see a way out of considering the first question as the quickest possible sorting of a real sample. Using either quicksort or heapsort, I achieve sorting the 5 numbers in exactly 6 comparisons for any order of the initial sample. (Now, there may be an even faster way based on comparing partial sums first… I just do not see how!) For the second part, let us start from the remark that 32 comparisons are needed to find the largest number, then at most 31 for the second largest, and at most 30 for the third largest (since we can take advantage of the partial ordering resulting from the determination of the largest number). This is poor. If I instead use a heap algorithm, I need O(n log{n}) comparisons to build this binary tree whose parents are always larger than their siblings, as in the above example. (I can produce a sort of heap structure, although non-binary, in an average 16×2.5=40 steps. And a maximum 16×3=48 steps.) The resulting tree provides the largest number (100 in the above example) and at least the second largest number (36 in the above). To get the third largest number, I first need a comparison between the one-before-last terms of the heap (19 vs. 36 in the above), and one or two extra comparisons (25 vs. 19 and maybe 25 vs. 1 in the above). (This would induce an average 1.5 extra comparison and a maximum 2 extra comparisons, resulting in a total of 41.5 average and 49.5 maximum comparisons with my sub-optimal heap construction.)  Once again, using comparisons of sums may help in speeding up the process, for instance comparing numbers by groups of 3, but I did not pursue this solution… If instead I try to adapt quicksort to this problem, I can have a dynamic pivot that always keep at most two terms above it, providing the three numbers as a finale result. Here is an R code to check its performances: quick3=function(x){ comp=0 i=1 lower=upper=NULL pivot=x[1] for (i in 2:length(x)){ if (x[i]1) comp=comp+1} comp=comp+1 if (length(upper)==3){ pivot=min(upper) upper=sort(upper)[-1] }} if (length(upper) When running this R code on 10? random sequences of 33 terms, I obtained the following statistics, I obtained the following statistics on the computing costs > summary(costs) Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 32.00   36.00   38.00   37.89   40.00   49.00 and the associated histogram represented above. Interestingly, the minimum is the number of comparisons needed to produce the maximum! Filed under: Books, Kids, pictures, R Tagged: heapsort, Le Monde, mathematical puzzle, pracma, quicksort, R
about 5 hours ago
(This article was first published on imDEV » r-bloggers, and kindly contributed to R-bloggers) After being busy the last two weeks teaching and attending academic conferences, I finally found some time to do what I love, progra...
(This article was first published on imDEV » r-bloggers, and kindly contributed to R-bloggers) After being busy the last two weeks teaching and attending academic conferences, I finally found some time to do what I love, program data visualizations using R. After being interested in Shiny for a while, I finally decided to pull the trigger and build my first Shiny app! I wanted to make a proof of concept app which contained the following dynamics which are the basics of any UI design: 1) dynamic UI options 2) dynamically updated plot based on UI inputs Here is what I came up with. Check out the app for yourself  on your local machine or the R code HERE. library(shiny) runGist('5792778') The app consists of a user interface (UI)  for selecting the data, variable to plot , grouping factor for colors and four plotting options: boxplot (above), histogram, density plot and bar graph. As an added bonus the user can select to show or hide jittered points in the boxplot visualization. Generally #2 above was well described and easy to implement, but it took a lot of trial and error to figure out how to implement #1. Basically to generate dynamic UI objects, the UI objects need to be called using the function shiny:::uiOutput()  in the ui.R file and their arguments set in the server.R file using the function shiny:::renderUI(). After getting this to work everything else fell in place. Having some experience with making UI’s in VBA (visual basic) and gWidgets; Shiny is a joy to work with once you understand some of its inner workings. One aspect I felt which made the learning experience frustrating was the lack of informative errors coming from Shiny functions. Even using all the R debugging tools having Shiny constantly tell me something was not correctly called from a reactive environment or the error was in the runApp() did not really help. My advice to anyone learning Shiny is to take a look at the tutorials, and particularly the section on Dynamic UI. Then pick a small example to reverse engineer. Don’t start off too complicated else you will have a hard time understanding which sections of code are not working as expected. Finally here are some screen shots, and keep an eye out for more advanced shiny apps in the near future. To leave a comment for the author, please follow the link and comment on his blog: imDEV » r-bloggers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
about 9 hours ago
(This article was first published on Noam Ross - R, and kindly contributed to R-bloggers) table tbody {border-top:2px; border-bottom:2px;} table thead {border-bottom:1px;} Yesterday, I was creating a knitr docume...
(This article was first published on Noam Ross - R, and kindly contributed to R-bloggers) table tbody {border-top:2px; border-bottom:2px;} table thead {border-bottom:1px;} Yesterday, I was creating a knitr document based on a script, and was looking for a way to include content from an R help file. The script, which was a teaching document, had a help() command for when the author wanted to refer readers to R documentation. I wanted that text in my final document, though. There’s no standard way to do this in R, but with some help from Stack Overflow and Scott Chamberlain, I figured out I needed some functions hidden in the depths of the tools package. So I wrote this function: help_console function(topic, format=c("text", "html", "latex", "Rd"), lines=NULL, before=NULL, after=NULL) { format=match.arg(format) if (!is.character(topic)) topic deparse(substitute(topic)) helpfile = utils:::.getHelpFile(help(topic)) hs capture.output(switch(format, text=tools:::Rd2txt(helpfile), html=tools:::Rd2HTML(helpfile), latex=tools:::Rd2latex(helpfile), Rd=tools:::prepare_Rd(helpfile) ) ) if(!is.null(lines)) hs hs[lines] hs c(before, hs, after) cat(hs, sep="\n") invisible(hs) } help_console prints the help file to the console or lets you assign the help file text to a character. Below, I use it to dynamically print the start of the help file for the optim() function as quoted HTML (note that the knitr chunk has the option results='asis'): help_console(optim, "html", lines = 1:25, before = "", after = "") R: General-purpose Optimization optim R Documentation General-purpose Optimization Description General-purpose optimization based on Nelder–Mead, quasi-Newton and conjugate-gradient algorithms. It includes an option for box-constrained optimization and simulated annealing. Usage optim(par, fn, gr = NULL, …, method = c(“Nelder-Mead”, “BFGS”, “CG”, “L-BFGS-B”, “SANN”, “Brent”), lower = -Inf, upper = Inf, control = list(), hessian = FALSE) The function is part of my noamtools package on GitHub, where I keep various convenience functions. Enjoy, and fork if you have improvements! To leave a comment for the author, please follow the link and comment on his blog: Noam Ross - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
about 9 hours ago
(This article was first published on Tatvic Blog » R, and kindly contributed to R-bloggers) As we believe you may know, we are having a webinar tomorrow (June 19th, 2013) on Predictive Analytics. During this webinar, you are goi...
(This article was first published on Tatvic Blog » R, and kindly contributed to R-bloggers) As we believe you may know, we are having a webinar tomorrow (June 19th, 2013) on Predictive Analytics. During this webinar, you are going to be introduced to R, learn how to build a predictive model and also how to carry insightful analysis through visualization.As learning a new language can be a really difficult and painful process, we thought that it would be a valuable idea to share useful links for R resources with you. If you can spare some time to read some of these links, we believe that this first briefing will enable you to come with a better background to our webinar.So, what do you say? Are you in for a reading and for reducing your learning curve?DownloadsR : http://www.r-project.org/ Choose your nearest download location and click on the appropriate link RStudio : http://www.rstudio.com/PackagesRGoogleAnalytics :https://code.google.com/p/r-google-analytics/ Guide to getting started with RGoogleAnalytics :http://bit.ly/11kUgzI Guide to getting started with ggplot2 :http://www.cookbook-r.com/Graphs/ Ggplot2 chart chooser :http://www.yaksis.com/posts/r-chart-chooser.html Finding additional R packages for your domain : http://cran.r-project.org/web/views/ Additional Ideas for Predictive modelling :http://bit.ly/13XyCCKCourses on RCodeschool : http://tryr.codeschool.com/ 2 minute short videos on R: http://www.twotorials.com/CommunityA Prezi tour of the R ecosystem :http://prezi.com/s1qrgfm9ko4i/the-r-ecosystem/ R news and tutorials from prominent R blogs : http://www.r-bloggers.com/ A search engine for R: http://www.rseek.org/If you come across more resources, please ensure that you drop a comment below.Kushan ShahKushan is a Web Analyst at Tatvic. His interests lie in getting the maximum insights out of raw data using R and Python.Website - Twitter - Facebook - More Posts To leave a comment for the author, please follow the link and comment on his blog: Tatvic Blog » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
about 10 hours ago
(This article was first published on Gianluca Baio's blog, and kindly contributed to R-bloggers) After months of work (although to be fair, we haven't worked 100% full time on this), Andrea and I are nearly ready to publish the ...
(This article was first published on Gianluca Baio's blog, and kindly contributed to R-bloggers) After months of work (although to be fair, we haven't worked 100% full time on this), Andrea and I are nearly ready to publish the next release of BCEA. Andrea has done a brilliant job and is responsible for most of the good new features (NB: see what I'm doing here? Subtly putting the blame on him if things go tits up, but also appearing like a magnanimous supervisor who's only pretending not to deserve full credit, if they don't...). But, seriously, I really mean that he's been brilliant, especially since he had to put up with my being extremely picky on details like font size and similar! Anyway, we're really excited (well, at least as excited as you can be about a computer package) about the new features, which basically are of three types.The first one is in terms of the graphical capabilities of the package. We have implemented all the graphical functions in ggplot2, which now complements the base graphical engine.The second one is that we have included the possibility of running multiple health economic comparisons in a single evaluation. In the current version of BCEA, you are allowed to have many interventions, but the comparisons are performed pairwise against one of them, which the user defines as the "reference" intervention. Now, it will be possible to produce an analysis of all the interventions jointly. This has clear links with multiple treatment comparisons (as pointed out in chapter 9 here).The third new feature allows the user to compute the expected value of partial information (EVPPI), with respect to one of the parameters included in the model. This is a very important aspect of the process of probabilistic sensitivity analysis and normally is performed using a two-stage MCMC process (which is explained in chapter 3 and 4 of BMHE). But this can be (and nearly always is) a very computationally intensive process. Also, you can't use too few iterations in either of the two MCMC stages, because that has a crucial impact on the precision of the results. Also, it is difficult to standardise the analysis using the two-stage MCMC approach, because it depends very much on the model being fitted. However, Mohsen Sadatsafavi and colleagues have recently published a paper in which they found a clever way of approximating the EVPPI, once the original model has been run (ie with a single MCMC step, which you would do anyway). I wasn't aware of the paper, but after Mohsen contacted me and pointed it out, I decided we should implement it in BCEA.I'm not completely sold on the ggplot2 thing. I think it can be very good and gives you a lot of freedom and flexibility. But sometimes it feels like overkilling it, really. But, for example, it will be helpful in problems with multiple interventions, where it is more important that the user can customise the resulting graphs, given that they can be very cluttered, if there are many interventions being compared at the same time (at the moment we allow a maximum of 6).In the next couple of days we'll release the new version as some sort of beta test. We have done some tests ourselves, of course, and everything seems to work OK. But of course it would be good if we could get more feedbacks on different problems. To leave a comment for the author, please follow the link and comment on his blog: Gianluca Baio's blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
about 11 hours ago
When you go to a conference, there are typically several talks going on at the same time, and you can always tell there's a popular paper coming up when you see people leave a bunch of rooms at once and head straight into one. There's al...
When you go to a conference, there are typically several talks going on at the same time, and you can always tell there's a popular paper coming up when you see people leave a bunch of rooms at once and head straight into one. There's also the unfortunate case when someone speaks, and there's only a handful of people in the room, all in the back staring at their laptops. Open Data City visualized this activity during the German internet conference re: publica. Open Data City used MAC addresses and access point connections to keep track of where devices went. So a person might be in a room connected to the nearest access point, disconnects as he leaves, and then reconnects as he reenters another room, which provides the flow. It's fun to watch the conference play out even if you didn't attend. Each dot represents an attendee, and as the animation plays the dots migrate from room to room. Click and drag over the dots to select specific people. [Thanks, Michael]
about 11 hours ago
sfRead Patrick Butler's analysis of the numbers hereSee the interactive child poverty map of the UKGet the dataasdDownload the data• DATA: download the full spreadsheetCan you do more with this data?• Contact us at data@guardian.co.uk• F...
sfRead Patrick Butler's analysis of the numbers hereSee the interactive child poverty map of the UKGet the dataasdDownload the data• DATA: download the full spreadsheetCan you do more with this data?• Contact us at data@guardian.co.uk• Follow us on Twitter• Like us on FacebookMore open dataData journalism and data visualisations from the GuardianDevelopment and aid data• Search the world's global development data with our gatewayPovertyOffice for National StatisticsFamilyChildrenWelfareMona Chalabiguardian.co.uk © 2013 Guardian News and Media Limited or its affiliated companies. All rights reserved. | Use of this content is subject to our Terms & Conditions | More Feeds
about 12 hours ago
Our feelings of financial security (or lack thereof) reveal much more than how gloomy we are - they're a key economic indicator. Now, data from NatCen shows big differences between young and old people's perceptions about their personal ...
Our feelings of financial security (or lack thereof) reveal much more than how gloomy we are - they're a key economic indicator. Now, data from NatCen shows big differences between young and old people's perceptions about their personal financesGet the dataRead commentary from Paul Johnson hereLast week saw the release of statistics on households below average income - in other words, poverty data. This provides critical information about who are the most economically vulnerable, but it fails to capture who feels the most economically vulnerable in British society. Those feelings don't just affect mental health and voting patterns, they also have concrete economic consequences. When people feel financially secure, they spend more, and that can help boost the economy. Consumer confidence is therefore something that economists and policy makers watch closely.It's reciprocal of course - when economic performance is bad, consumer confidence is low - but you can separate out the two concepts. Not everyone can accurately judge the state of the economy so their personal feelings might not mirror national performance. So how has economic news affected us lately? Data from NatCen has drawn together various surveys that ask UK households how they feel about their finances. Unsurprisingly, since 2007, a greater proportion of people report they are 'finding it difficult' or 'finding it very difficult' when asked how they are managing financially.It's also perhaps not surprising that those responses have risen the fastest for those in the 'unemployed' and 'long-term sick and disabled' categories. What does however stand out from these numbers is just how insulated pensioners appear to feel from the financial crisis that began in 2007. In 2007, 41% of pensioners said they were 'living comfortably' compared to 28% of families with children, 27% of 16-24 year olds and just 14% of the unemployed offering the same response. By 2011, 39% of pensioners said the same - a fall of just 2% since the start of the crisis. But for other groups, the decline has been far more dramatic - 11% of those with a long-term sickness or disability said they were living comfortably in 2007 but just 5% said the same in 2011. Dr Matt Barnes, Research Director of NatCen who pulled together the research said:The evidence suggests that the recession has had a larger impact on some groups rather than othersOptimismDespite the fact that they appear more insulated, pensioners appear to be the most pessimistic group about their future finances. When asked how they feel their financial situation might differ a year from now, pensioners are the most likely group to respond 'worse off'.Perhaps those fears, together with the fact that pensioners represent key voters mean that austerity measures targetted at them are often dubbed 'political suicide'. It should also be noted that pensioners often experience other types of vulnerability and though as a group they may not feel so deeply affected by the crisis, there are still 1.8 million of them living in poverty according to Age UK. Download the data• DATA: download the full spreadsheetCan you do more with this data?• Contact us at data@guardian.co.uk• Follow us on Twitter• Like us on FacebookMore open dataData journalism and data visualisations from the GuardianDevelopment and aid data• Search the world's global development data with our gatewayPovertyFinancial crisisFamily financesMona Chalabiguardian.co.uk © 2013 Guardian News and Media Limited or its affiliated companies. All rights reserved. | Use of this content is subject to our Terms & Conditions | More Feeds
about 12 hours ago
Cartographer Andy Woodruff has made a stylised map of bus speeds in Boston, MA, using GPS data from the Massachusetts Bay Transport Authority and NextBusJohn Burn-Murdoch
Cartographer Andy Woodruff has made a stylised map of bus speeds in Boston, MA, using GPS data from the Massachusetts Bay Transport Authority and NextBusJohn Burn-Murdoch
about 12 hours ago