Sunday, August 7, 2016

Enron becomes unlikely records source for computer technology researchers



Laptop technological know-how researchers have turned to not likely sources -- together with Enron -- for assembling big collections of spreadsheets that can be used to observe how humans use this software program. The goal is for the statistics to facilitate research to make spreadsheets extra useful.

"We take a look at spreadsheets because spreadsheet software is used to song the whole lot from corporate earnings to employee benefits, or even simple errors can cost businesses thousands and thousands of dollars," says Emerson Murphy-Hill, an assistant professor of laptop technological know-how at NC country and co-writer of two new papers at the paintings.

However, there are highly few public collections of spreadsheet facts to be had for studies purposes. as an example, the collection presently used by most researchers consists of approximately 4,500 spreadsheets.
however researchers are actually making  new collections available -- one has 15,000 spreadsheets and the alternative has greater than 249,000.

"Similarly, we're publishing a method that different researchers can use to gather additional spreadsheet data," Murphy-Hill says.

The 15,000 spreadsheet collection consists completely of spreadsheets amassed from internal Enron emails, which were made public after the emails were subpoenaed with the aid of prosecutors.

"Our recognition is on how users interact with spreadsheets," Murphy-Hill says. "And these spreadsheets actually inform us a lot approximately how customers represent and manipulate statistics."

To gather the second set of spreadsheets, called Fuse, the researchers developed their personal technique to become aware of and extract spreadsheets from a web archive of over 5 billion webpages. using their method, the researchers accumulated 249,376 spreadsheets -- together with spreadsheets made as these days as 2014.

"Fuse used cloud infrastructure to look via billions of webpages to discover and extract the spreadsheets we write about on this paper," says Titus Barik, a Ph.D. student at NC country, researcher at ABB corporate studies, and lead author of the paper on Fuse. "Commodity cloud computing is fantastically thrilling -- looking those pages could take approximately seven years of continuous computation on a unmarried laptop, but the economies of scale with cloud computing allowed us to perform this with Fuse in just a few days."

"And the truth that Fuse includes latest spreadsheets is a significant gain over other spreadsheet collections, because the facts is more up to date and reflects changes in Excel and other spreadsheet software program," Murphy-Hill says.

"Fuse is also extra reproducible than other spreadsheet collections," says Kevin Lubick, a Ph.D. scholar at NC kingdom and co-author of a paper about Fuse. "Reproducibility is the cornerstone of proper scientific studies, but many existing spreadsheet collections are difficult to reproduce. Our method can be utilized by each person, and they will get the identical consequences we get. however the consequences may also consist of any new spreadsheets made to be had since the last time the program become run."

No comments:

Post a Comment