CSI 991

Seminar in Computational Statistics:

Recent articles in computational learning and computational statistics

Spring, 2016

Fridays; Innovation Hall 139

CSI 991 Section 001

Contact:
jgentle@gmu.edu

Overview

The seminar will be conducted in the form of a journal club, in which recent research articles are summarized.

The seminar may also include discussions, tutorials, and demonstrations of current computational methodology.

The objective of the journal club is three-fold:

to be exposed to summaries of a sampling of the recent literature.
to work carefully through two examples of recent research, possibly in the area in which the student will do dissertation research.
to gain experience in making a research presentation.

The objective of tutorials and demonstrations of computational methodology is to introduce seminar participants to current developments and practices that may not be covered in regular CSI courses. Possible topics include

use of Amazon's Web Services (AWS) for computational tasks.
GPU programming (CUDA, etc.).
HADOOP, Spark, or similar cluster frameworks.
Python numpy, including comparisons with R and/or Matlab capabilities. etc.
Advanced R (e.g., material from Hadley Wickham's book).
web scraping (XML, etc.)

Attendance at the seminar is open to anyone.

CSI 991 Seminar in Computational Statistics and Computational Learning

Students may enroll in CSI 991, Section 001, for one hour credit.
The course is graded as "S" or "U".
In order to receive a grade of "S", students who are enrolled in the class will be required to

attend "most of" the sessions,

submit any assignments in a timely fashion, and

make two 30-to-60-minute presentations, along with a brief written report.

One of the reguired presentations must be on a recent article in computational learning or computational statistics. ("Recent" means 2005 or later.) The other presentation can be another article or it can be on a computational tool or facility, and in that case some actual computing or communication demonstrated.

A brief written report is also required. This can be in the form of straight text (5 or 6 pages) or a copy of the presentation slides.

The presentations on research articles can be simple summaries, or, preferably, critical reviews citing other work or possible approaches. Monte Carlo studies or applications on sample datasets would be nice.

Make sure that work that is supposed to be yours is indeed your own.

With cut-and-paste capabilities on webpages, it is easy to plagiarize.
Sometimes it is even accidental, because it results from legitimate note-taking; nevertheless, it is plagiarism and it is illegal.
Whenever you include a picture, graphic, or text from another source, give a clear citation of the previous work.

Scientific Literature

The advancement of science depends on the dissemination of the results of research so that others can build on that research. Research results are disseminated in various ways including personal communication, webpages, formal education in a classroom or individualized instruction, seminars, presentations at organized conferences and/or published proceedings of presentations at those conferences, books, journals, and magazines.

Publication of the results of research is generally motivated by personal ambition, rather than by desire to advance science. Personal ambition may include just the satisfaction of becoming better-known within the scientific community and the advancement in one's field of employment. These factors contribute to the plethora of publications. The plethora results in a wide range in the quality of publications. It is an unfortunate fact that almost anything can be published somewhere.

Different fields of science place differing values on various types of publications. In the mathematics and the legacy sciences, "archival" journals have always been at the top. In computer science and various related areas, conference proceedings have risen to the top. Within the class of journals, "impact factors" or various other measures serve to order the journals within a given field. Conferences are often ranked based on "acceptance rate", in which the numerator is the number of papers presented and the denominator is some number determined by the conference organizers, ostensibly related to the number of papers that were submitted for possible presentation.

References to the Scientific Literature

The standard form of a bibliographic entry varies from one publication venue to another. At the very least a bibliographic entry must include the name(s) of the author(s), the date of publication, the title of the work, and whatever else is necessary for someone to locate the work.

The form of the bibliographic entry usually depends on the type of publication being referenced, whether it is a book or other type of stand-alone publication, an article in a journal, an article in an edited book, an article in a conference proceedings, a webpage, or some other type of publication.

Webpages and some other types of publication require an additional bit of information: the time when it was accessed (usually just the date of access). This is necessary because the authors of webpages can change the contents without changing the URL.

In a book, the author usually can choose the form to use. In conference proceedings and journals, the organizers or the editors may allow the individual articles to use different forms of bibliographic entries or they may require all articles to use some specified form. Some people feel that use of abbreviations in authors' first names and in the name of the journal raises the level of scholarship.

Use of DOI

A digital object identifier (DOI) is a character string that uniquely identifies an electronic file. The file may be a document, an image, a video, or some other type of object. For our purposes in this seminar, it is most likely an article or a book.

The character string is divided into two parts, separated by a forward slash. The first part of the string indicates the organization that registered the file, and the second part is the unique string assigned to the object. For example, in the DOI 10.1080/0952813X.2010.505800 the first part indicates the Taylor & Francis Group, which is a large publishing house that publishes a number of journals (Including Journal of the American Statistical Association) and books. The second part indicates a specific article. The format of the second part is decided by the registering organization.

A registry is maintained by a consortium organization, called the International DOI Foundation, that maintains information about where the object is stored on the internet. To find the document in the example DOI above, use
https://dx.doi.org/10.1080/0952813X.2010.505800
This takes you to a site that is devoted to that specific document. That site is "free", but to access the document, some privilege, such as a subscription or direct payment, may be required. In some cases, your GMU credentials may take you to the document. In other cases, even if GMU has full access rights, you can't get there from here; you have to go to a GMU site and proceed to the location of the document.

Examples of References

You can find examples of different forms of bibliographic entries in the references sections of books and journal articles. For your assignments, please use a "standard" bibliographic format. (There are lots of "standards".)

Here are some examples of the form that I prefer for various type of entries from the oldest sources I could find.

Note the general arrangement of the various fields, the various punctuation marks used, the use of exact names as in the published source, and so on. This is the style I use. You can use whatever you want so long as it conforms to the rules of whoever publishes it and so long as it contains all of the relevant fields.

Citations to the Scientific Literature

Within the text of a document, we make citations to entries in the bibliography. In different areas of science and/or in different journals, there are different standard ways of citing the entries in the bibliography. A simple way is just to number the entries sequentially and use the number, usually enclosed in square brackets, to reference them in the text. The entries in the bibliography may be ordered alphabetically by the last name of the first author or be the order that they are cited in the document.

Another type of citation that is very popular in computer science and various related areas is to assign each bibliography entry a hash code of numerals and letters based on the author(s) name(s), the date, and/or other parts of the bibliographic entry.

Another simple type of citation is the author(s) last name(s) and the date of publication, possibly with an appended "a", "b", etc., enclosed in parentheses. This is the method I prefer and use, because when reading the article, if I'm somewhat familiar with the literature, the citation tells me what the reference is without my having to go to the bibliography.

References and Citations in TeX.

I strongly encourage the use of TeX in scientific writing.

There is a nice package in TeX for producing a bibliography called bibtex. There are other nice packages that tie in with bibtex for making citations. I use these packages most of the time.

Accessing the Scientific Literature at GMU

Find online articles at http://library.gmu.edu/
Click on "E-Journals" and then enter keyword such as
"machine learning", "statistical learning", "computational statistics", etc. or else enter the name of a specific journal.

Assignments

Assignment 1; due February 5: Select an article for your first presentation.
Email bibliographic info (author, year, title, journal/proceedings name, page numbers) for your first presentation to the instructor.
Please use an appropriate bibliographic format.
Assignment 2; due March 18: Select an article or a computational topic for your first presentation.
If the presentation is on a recent research article, email bibliographic info (author, year, title, journal/proceedings name, page numbers) to the instructor. Please use an appropriate bibliographic format.
If the presentation is to be on a computational tool or facility, briefly describe what you propose to discuss and indicate the format of your presentation (demo, code walk-through, etc.).

Schedule

~~January 22~~
Canceled because of weather.
January 29 through February 5 class does not meet
February 5
Assignment 1 due.
February 12
4:00pm
Discussion: What's the best PDF reader? (We just want to read and of course navigate.) Try it on this book.
See LifeHacker's comments and Gizmo's general review, both links courtesy of Kevin Ham. (Thanks!)
See comments by William Ampeh. (Thanks!)
4:05pm
Presentation by William Ampeh.
Chaovalit, Pimwadee; Aryya Gangopadhyay; George Karabatis; and Zhiyuan Chen (2011), Discrete wavelet transform-based time series analysis and mining, ACM Computing Surveys, article no. 6.
4:45pm
Presentation by John Leung.
Bobadilla, J.; F. Ortega; A. Hernando; and A. Gutiérrez (2013), Recommender systems survey, Knowledge-Based Systems, 46, 109--132.
February 19
4:00pm
Preliminary discussions. Any good tips about software/platforms?
Some comments by William Ampeh. (Thanks!)
4:10pm
Presentation by William Basinger.
Erdin,Rebekka; Christoph Frei; and Hans Kuensch (2012), Data transformation and uncertainty in geostatistical combination of radar and rain gauges, Journal of Hydrometeorology, 1332--1346.
4:40pm
Presentation by Redouane Betrouni.
Wu, Wenyan; Robert J. May; Holger R. Maier, and Graeme C. Dandy (2013), On a benchmarking approach for comparing data splitting methods for modeling water resources parameters using artificial neural networks, Water Resources Research 49, 7598--7614.
February 26 through March 11 class does not meet
March 18
Assignment 2 due.

4:00pm
Preliminary discussions. Any good tips about software/platforms?
The presentations this week, and some in subsequent weeks, are on deep learning, which, aside from just its catchy name, does offer some additional analytical power. The common type of deep learning methods are based on neural nets (or ``artificial'' neural nets, ANN). Their main characteristic is the use of multiple hidden layers, but with modifications to prevent overfitting, "convolutions".
William Ampeh has provided the following links for information about R packages that implement neural nets. (Thanks!)
- Description of the neuralnet R package, which allows multiple hidden layers, but apparently not a convoluted structure.
- More on neuralnet
- The caret package.
4:10pm
Presentation by Kevin Ham.
Roy, D.; K.S.R. Murty; and C.K. Mohan (2015), Feature selection using deep neural networks, 2015 International Joint Conference on Neural Networks (IJCNN), 1--6.
4:40pm
Presentation by Yijun Wei.
LeCun, Yann; Yoshua Bengio; and Geoffrey Hinton (2015), Deep learning, Nature, 521, 436--444.
March 25
4:00pm
Preliminary discussions. Any good tips about software/platforms?
4:10pm
Presentation by Ibrahim Elhag.
Mone, G. (2013), Beyond Hadoop, Communications of the ACM ???

This finishes the first round of presentations.

4:40pm
Presentation by William Ampeh.
Tools for scientific computation.
Slides.
April 1
4:00pm
Preliminary discussions. Any good tips about software/platforms?
4:10pm
Presentation by John Leung.
Deep Learning with CUDA on GPU through the Python Theano Library.
4:40pm
Presentation by William Basinger.
Von Tscharner. M.; S.M. Schmalholz; and J.-L. Epard (2016), 3-D Numerical Models of Viscous Flow Applied to Fold Nappes and the Rawil Depression in the Helvetica Nappe System (Western Switzerland), Journal of Structural Geology, 32--46.
April 8
4:00pm
Preliminary discussions. Any good tips about software/platforms?
4:10pm
Presentation by Redouane Betrouni.
April 15

4:00pm
Presentation by Ibrahim Elhag.
Radford, C.J. (2014), Challenges and solutions protecting data within Amazon Web Services, Network Security, 5--8.
4:25pm
Presentation by Kevin Ham.
Jia, Yangqing; Evan Shelhamer; Jeff Donahue; Sergey Karayev; Jonathan Long; Ross Girshick; Sergio Guadarrama; and Trevor Darrell (2014), CAFFE: Convolutional Architecture for Fast Feature Embedding, Proceedings of the 22nd ACM International Conference on Multimedia, 675--678
doi: 10.1145/2647868.2654889
Abstract.
4:50pm
Presentation by Yijun Wei.
Mnih, Volodymyr; Koray Kavukcuoglu; David Silver; Andrei A. Rusu; Joel Veness; Marc G. Bellemare; Alex Graves; Martin Riedmiller; Andreas K. Fidjeland; Georg Ostrovski; Stig Petersen; Charles Beattie; Amir Sadik; Ioannis Antonoglou; Helen King; Dharshan Kumaran; Daan Wierstra; Shane Legg; and Demis Hassabis (2015), Human-level control through deep reinforcement learning, Nature, 518, 529--533.