SCIgen

Last updated
SCIgen
Repository
Written in Perl
Available inEnglish
Type Paper generator
License GNU General Public License
Website http://pdos.csail.mit.edu/scigen/

SCIgen is a paper generator that uses context-free grammar to randomly generate nonsense in the form of computer science research papers. Its original data source was a collection of computer science papers downloaded from CiteSeer. All elements of the papers are formed, including graphs, diagrams, and citations. Created by scientists at the Massachusetts Institute of Technology, its stated aim is "to maximize amusement, rather than coherence." [1] Originally created in 2005 to expose the lack of scrutiny of submissions to conferences, the generator subsequently became used, primarily by Chinese academics, to create large numbers of fraudulent conference submissions, leading to the retraction of 122 SCIgen generated papers and the creation of detection software to combat its use. [2]

Contents

Sample output

Opening abstract of Rooter: A Methodology for the Typical Unification of Access Points and Redundancy: [3]

Many physicists would agree that, had it not been for congestion control, the evaluation of web browsers might never have occurred. In fact, few hackers worldwide would disagree with the essential unification of voice-over-IP and public/private key pair. In order to solve this riddle, we confirm that SMPs can be made stochastic, cacheable, and interposable.

Prominent results

In 2005, a paper generated by SCIgen, Rooter: A Methodology for the Typical Unification of Access Points and Redundancy, was accepted as a non-reviewed paper to the 2005 World Multiconference on Systemics, Cybernetics and Informatics (WMSCI) and the authors were invited to speak. The authors of SCIgen described their hoax on their website, and it soon received great publicity when picked up by Slashdot. WMSCI withdrew their invitation, but the SCIgen team went anyway, renting space in the hotel separately from the conference and delivering a series of randomly generated talks on their own "track". The organizer of these WMSCI conferences is Professor Nagib Callaos. From 2000 until 2005, the WMSCI was also sponsored by the Institute of Electrical and Electronics Engineers. [4] The IEEE stopped granting sponsorship to Callaos from 2006 to 2008.

Submitting the paper was a deliberate attempt to embarrass WMSCI, which the authors claim accepts low-quality papers and sends unsolicited requests for submissions in bulk to academics. As the SCIgen website states:

One useful purpose for such a program is to auto-generate submissions to conferences that you suspect might have very low submission standards. A prime example, which you may recognize from spam in your inbox, is SCI/IIIS and its dozens of co-located conferences (check out the very broad conference description on the WMSCI 2005 website).

About SCIgen [5]

Computing writer Stan Kelly-Bootle noted in ACM Queue that many sentences in the "Rooter" paper were individually plausible, which he regarded as posing a problem for automated detection of hoax articles. He suggested that even human readers might be taken in by the effective use of jargon ("The pun on root/router is par for MIT-graduate humor, and at least one occurrence of methodology is mandatory") and attribute the paper's apparent incoherence to their own limited knowledge. His conclusion was that "a reliable gibberish filter requires a careful holistic review by several peer domain experts". [6]

Schlangemann

The pseudonym "Herbert Schlangemann" was used to publish fake scientific articles in international conferences that claimed to practice peer review. The name is taken from the Swedish short film Der Schlangemann .

In all cases, the published papers were withdrawn from the conferences' proceedings, and the conference organizing committee as well as the names of the keynote speakers were removed from their websites.

List of works with notable acceptance

In conferences

  • Rob Thomas: Rooter: A Methodology for the Typical Unification of Access Points and Redundancy, 2005 for WMSCI (see above)
  • Mathias Uslar's paper was accepted to the IPSI-BG conference. [12]
  • Professor Genco Gulan published a paper in the 3rd International Symposium of Interactive Media Design. [13]
  • A 2013 scientometrics paper demonstrated that at least 85 SCIgen papers have been published by IEEE and Springer. [14] Over 120 SCIgen papers were removed according to this research. [15]

In journals

  • Students at Iran's Sharif University of Technology published a paper in Elsevier's Journal of Applied Mathematics and Computation. [16] The students wrote under the surname "MosallahNejad", which translates literally from Persian language (in spite of not being a traditional Persian name) as "from an Armed Breed". The paper was subsequently removed when the publishers were informed that it was a joke paper. [17]
  • Mikhail Gelfand published a translation of the "Rooter" article in the Russian-language Journal of Scientific Publications of Aspirants and Doctorants in August 2008. Gelfand was protesting against the journal, which was apparently not peer reviewed and was being used by Russian PhD candidates to publish in an "accredited" scientific journal, charging them 4000 Rubles to do so. The accreditation was revoked two weeks later. [18] [19] [20] [21] (See Dissernet for related information.)
  • Springer Science+Business Media and IEEE were also the subject of similar pranks.

Spoofing Google Scholar and h-index calculators

Refereeing performed on behalf of the Institute of Electrical and Electronics Engineers has also been subject to criticism after fake papers were discovered in conference publications, most notably by Labbé and a researcher using the pseudonym of Schlangemann. [22] [23] [24] [25] [26] [27]

Cyril Labbé from Grenoble University demonstrated the vulnerability of h-index calculations based on Google Scholar output by feeding it a large set of SCIgen-generated documents that were citing each other, effectively an academic link farm, in a 2010 paper. Using this method the author managed to rank "Ike Antkare" ahead of Albert Einstein for instance. [28]

2013 retractions

In 2013, over 122 published conference papers created by SCIgen were retracted by Springer and the IEEE. Unlike previous submissions that were intended to be pranks, this submission were largely made by Chinese academics, who were using SCIgen papers to boost their publication record. [29]

SciDetect

In 2015, SciDetect was released by Springer. This software, developed by Cyril Labbé, is designed to automatically detect papers generated by SCIgen. [2]

2021 report

In 2021, a study was published on 243 SCIgen papers that had been published in the academic literature. They found that SCIgen papers made up 75 per million papers (< 0.01%) in information science, and that only a small fraction of the detected papers had been dealt with. [30] [31]

See also

Related Research Articles

<span class="mw-page-title-main">Computer science</span> Study of computation

Computer science is the study of computation, information, and automation. Computer science spans theoretical disciplines to applied disciplines.

In cryptography, RC4 is a stream cipher. While it is remarkable for its simplicity and speed in software, multiple vulnerabilities have been discovered in RC4, rendering it insecure. It is especially vulnerable when the beginning of the output keystream is not discarded, or when nonrandom or related keys are used. Particularly problematic uses of RC4 have led to very insecure protocols such as WEP.

The waterfall model is a breakdown of development activities into linear sequential phases, meaning they are passed down onto each other, where each phase depends on the deliverables of the previous one and corresponds to a specialization of tasks. The approach is typical for certain areas of engineering design. In software development, it tends to be among the less iterative and flexible approaches, as progress flows in largely one direction through the phases of conception, initiation, analysis, design, construction, testing, deployment and maintenance. The waterfall model is the earliest SDLC approach that was used in software development.

<span class="mw-page-title-main">Scientific visualization</span> Interdisciplinary branch of science concerned with presenting scientific data visually

Scientific visualization is an interdisciplinary branch of science concerned with the visualization of scientific phenomena. It is also considered a subset of computer graphics, a branch of computer science. The purpose of scientific visualization is to graphically illustrate scientific data to enable scientists to understand, illustrate, and glean insight from their data. Research into how people read and misread various types of visualizations is helping to determine what types and features of visualizations are most understandable and effective in conveying information.

Peter Pin-Shan Chen is a Taiwanese-American computer scientist. He is a (retired) distinguished career scientist and faculty member at Carnegie Mellon University and Distinguished Chair Professor Emeritus at LSU. He is known for the development of the entity–relationship model in 1976.

Springer Science+Business Media, commonly known as Springer, is a German multinational publishing company of books, e-books and peer-reviewed journals in science, humanities, technical and medical (STM) publishing.

Dissociated press is a parody generator. The generated text is based on another text using the Markov chain technique. The name is a play on "Associated Press" and the psychological term dissociation.

WMSCI, the World Multi-conference on Systemics, Cybernetics and Informatics, is a conference that has occurred annually since 1995, which emphasizes the systemic relationships that exist or might exist among different disciplines in the fields of Systemics, Cybernetics, and Informatics. Organizers stress interdisciplinary communication, describing the conference as both wide in scope as a general international scientific meeting, and specifically focused in the manner of a subject-area conference.

Harlan D. Mills was professor of computer science at the Florida Institute of Technology and founder of Software Engineering Technology, Inc. of Vero Beach, Florida. Mills' contributions to software engineering have had a profound and enduring effect on education and industrial practice. Since earning his Ph.D. in Mathematics at Iowa State University in 1952, Mills led a distinguished career.

A hyper-heuristic is a heuristic search method that seeks to automate, often by the incorporation of machine learning techniques, the process of selecting, combining, generating or adapting several simpler heuristics to efficiently solve computational search problems. One of the motivations for studying hyper-heuristics is to build systems which can handle classes of problems rather than solving just one problem.

<span class="mw-page-title-main">Erik Möller</span> German journalist and software developer (born 1979)

Erik Möller is a German freelance journalist, software developer, author, and former deputy director of the Wikimedia Foundation (WMF), based in San Francisco. Möller additionally works as a web designer and previously managed his own web hosting service, myoo.de. As of 2022, he was VP of Engineering at the Freedom of the Press Foundation.

<span class="mw-page-title-main">Postmodernism Generator</span> Computer program

The Postmodernism Generator is a computer program that automatically produces "close imitations" of postmodernist writing. It was written in 1996 by Andrew C. Bulhak of Monash University using the Dada Engine, a system for generating random text from recursive grammars. A free version is also hosted online. The essays are produced from a formal grammar defined by a recursive transition network.

The SIAM Journal on Scientific Computing (SISC), formerly SIAM Journal on Scientific & Statistical Computing, is a scientific journal focusing on the research articles on numerical methods and techniques for scientific computation. It is published by the Society for Industrial and Applied Mathematics (SIAM). Hans De Sterck is the current editor-in-chief, assuming the role in January 2022. The impact factor is currently around 2.

Synthetic data is information that is artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models.

Scientific Research Publishing (SCIRP) is a predatory academic publisher of open-access electronic journals, conference proceedings, and scientific anthologies that are considered to be of questionable quality. As of December 2014, it offered 244 English-language open-access journals in the areas of science, technology, business, economy, and medicine.

<span class="mw-page-title-main">International Symposium on Microarchitecture</span>

The IEEE/ACM International Symposium on Microarchitecture® (MICRO) is an annual academic conference on microarchitecture, generally viewed as the top-tier academic conference on computer architecture. It is not to be confused with a micro-conference. Particularly within the domains of microarchitecture and Code generation (compiler), MICRO is unrivaled and esteemed as the premier forum. Association for Computing Machinery's Special Interest Group on Microarchitecture and Institute of Electrical and Electronics Engineers Computer Society are technical sponsors.

Parody generators are computer programs which generate text that is syntactically correct, but usually meaningless, often in the style of a technical paper or a particular writer. They are also called travesty generators and random text generators.

<span class="mw-page-title-main">Paper generator</span> Software to create fake academic articles

A paper generator is computer software that composes scholarly papers in the style of those that appear in academic journals or conference proceedings. Typically, the generator uses technical jargon from the field to compose sentences that are grammatically correct and seem erudite but are actually nonsensical. The prose is supported by tables, figures, and references that may be valid in themselves, but are randomly inserted rather than relevant.

Shai Halevi is a computer scientist who works on cryptography research at Amazon Web Services.

References

  1. SCIgen - An Automatic CS Paper Generator
  2. 1 2 Bohannon, John (2015-03-27). "Hoax-detecting software spots fake papers". Science | AAAS. Retrieved 2020-09-28. Rather than being created as pranks, it seems that many of the fake papers were coming from China where they were "bought by academics and students" to pad their publication records, says the lead researcher behind the investigation, Cyril Labbé, a computer scientist at Joseph Fourier University in Grenoble, France.
  3. Stribling, Jeremy; Aguayo, Daniel; Krohn, Maxwell. "Rooter: A Methodology for the Typical Unification of Access Points and Redundancy" (PDF).
  4. Heinrich Zankl: Der Science-Generator- ein geniales Publikationsprogramm. In W.Hömberg, E.Roloff (Herausgeber): Jahrbuch der Marginalistik IV: Lit-Verlag. Münster . 2016 S. 60–67. ISBN   978-3-643-99793-7
  5. "SCIgen - An Automatic CS Paper Generator". MIT.
  6. Stan Kelly-Bootle (July–August 2005). "Call that gibberish?". ACM Queue . 3 (6): 64. doi: 10.1145/1080862.1080884 .
  7. "CSSE Conference Program" (PDF).
  8. 1 2 "The official Herbert Schlangemann Blog, The whole story behind the paper "Towards the Simulation of E-Commerce"".
  9. kdawson (December 24, 2008). "Software-Generated Paper Accepted At IEEE Conference". Slashdot. VA Linux Systems Japan. Retrieved May 5, 2009.
  10. Peter-Michael Ziegler (December 26, 2008). "Dr. Herbert Schlangemann - oder die Geschichte eines pseudowissenschaftlichen Nonsens-Papiers (in German)". Heise Online. Heise Zeitschriften Verlag. Retrieved May 5, 2009.
  11. Heise Online webpage (in German)
  12. "Mathias Uslar's paper". Archived from the original on 2009-06-15.
  13. "About Genco Gulan's paper".
  14. "Duplicate and Fake Publications in the Scientific Literature : How many SCIgen papers in Computer Science?" (PDF). Hal.archives-ouvertes.fr. Retrieved 2014-05-15.
  15. "Publishers withdraw more than 120 gibberish papers". Nature. 24 February 2014. Retrieved 25 February 2014.
  16. Rohollah Mosallahnezhad. "Cooperative, Compact Algorithms for Randomized Algorithms" (PDF). Archived from the original (PDF) on 2009-12-29.
  17. Rohollah Mosallahnezhad (2007), "REMOVED: Cooperative, compact algorithms for randomized algorithms", Applied Mathematics and Computation, doi:10.1016/j.amc.2007.03.011
  18. "Mon ordinateur écrit mieux que le tien!". Agence Science-Presse (in French). Canada. 8 September 2009. Retrieved 4 October 2011.
  19. "Rooter invades Russia". SCIgen. 8 January 2009. Archived from the original on 2014-04-03. Retrieved 4 October 2011.
  20. Malozemov, Sergei (7 October 2008). Группа отечественных ученых поставила эксперимент — смешала сложные термины случайным образом, а полученный текст отослала в один из научных журналов. NTV (in Russian). Retrieved 4 October 2011.
  21. "Feedback". New Scientist. 15 August 2009.
  22. Labbé, Cyril; Labbé, Dominique (2013). "Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science?". Scientometrics. 94 (1): 379–396. doi:10.1007/s11192-012-0781-y. S2CID   6889400.
  23. Oransky, Ivan (February 24, 2014). "Springer, IEEE withdrawing more than 120 nonsense papers". retractionwatch.com. WordPress.com. Retrieved April 29, 2014.
  24. de Gloucester, Paul Colin (2013). "Referees Often Miss Obvious Errors in Computer and Electronic Publications". Accountability in Research: Policies and Quality Assurance. 20 (3): 143–166. Bibcode:2013ARPQ...20..143D. doi:10.1080/08989621.2013.788379. PMID   23672521. S2CID   42975675.
  25. Dawson, K. (December 23, 2008). "Software-Generated Paper Accepted At IEEE Conference". slashdot.org. Dice. Retrieved April 29, 2014.
  26. Hatta, Masayuki (December 24, 2008). "IEEEカンファレンス、自動生成のニセ論文をアクセプト". slashdot.jp (in Japanese). OSDN Corporation. Retrieved April 29, 2014.
  27. Ziegler, Peter-Michael (December 26, 2008). "Dr. Herbert Schlangemann - oder die Geschichte eines pseudowissenschaftlichen Nonsens-Papiers". heise.de (in German). Heise Zeitschriften Verlag. Retrieved April 29, 2014.
  28. "Les rapports de recherche du LIG" (PDF). Rr.liglab.fr. Retrieved 2014-05-15.
  29. Van Noorden, Richard (2014). "Publishers withdraw more than 120 gibberish papers". Nature News. doi: 10.1038/nature.2014.14763 .
  30. Cabanac, Guillaume; Labbé, Cyril (2021-05-25). "Prevalence of nonsensical algorithmically generated papers in the scientific literature". Journal of the Association for Information Science and Technology. 72 (12): 1461–1476. doi: 10.1002/asi.24495 . ISSN   2330-1635. S2CID   236374033.
  31. Noorden, Richard Van (2021-05-27). "Hundreds of gibberish papers still lurk in the scientific literature". Nature. 594 (7862): 160–161. Bibcode:2021Natur.594..160V. doi:10.1038/d41586-021-01436-7. PMID   34045760. S2CID   235232305.

Further reading