Thursday, August 30, 2012

German politician's son loses doctorate, too.

The Bavarian daily newspaper Münchener Abendblatt reports that now also the son of the former Bavarian premier Edmund Stoiber has been stripped of his doctorate by the Austrian University of Innsbrück. Previously, one of his daughters -- whose nickname "Vroni" was the basis for the plagiarism documentation platform VroniPlag Wiki -- had also had her doctorate rescinded on the basis of plagiarism by the German University of Constance.

The detection of the plagiarism in the doctorate of the son was done by Stefan Weber, a private plagiarism investigator, on behalf of the Austrian newspaper Tiroler Tageblatt, the Münchener Abendblatt reports. The newspaper notes that Stoiber's son is contesting the rescinding of his doctorate in the courts. His sister also went to court, but the court confirmed the position of the university.

Saturday, August 18, 2012

Guttenbergs Ghostwriter

I've just finished a novel that has been lying around for months. Ich war Guttenbergs Ghost - Eine Satire (I was Guttenbergs Ghost - A Satire), by Norbert Hoppe (a pseudonym). The idea that zu Guttenberg, the German defense minister whose plagiarized dissertation set off a long-running discussion of plagiarism, had hired a ghostwriter was hotly disputed last year.

The GuttenPlag Wiki has a number of pages that discusses all angles of this notion: The Ghostwriter Forum - The Stylistic Forum - The Stylistic Analysis. The variation used in this book is that a the author, Hoppe, who had known zu Guttenberg since third grade, had been helping shape the media figure zu Guttenberg for a number of years. When zu Guttenberg finds that he doesn't have the time to finish the thesis, he dumps the box of diskettes on Hoppe and asks him to finish it up for him.

The story is a bit long at times, and Hoppe professes a puppy love for zu Guttenberg's wife as one of the reasons he went along with everything through all the years. But the story does present a plausible explanation for the extensive patchwork quilt plagiarism in zu Guttenberg's thesis. It is also interesting to see how things change from when zu Guttenberg is just a member of parliament to when he becomes a minister. Everything is now focused on his media presence, and he get new "handlers" who organize his day.

I begin to understand why politicians don't actually seem to get things done. They are so absorbed in presenting themselves and making sure not to make public gaffes that there is little time left to be thinking about the actual politics. Hoppe gives a shockingly plausible reason for zu Guttenberg getting rid of conscription. He meets a recruit at a bus stop, the soldier is sitting sloppily with his feet on the seat. zu Guttenberg is shocked, tells him to sit up straight, and gets a snippy answer. "Do you know who I am?", zu Guttenberg demands. "Nö" the soldier answers.  When zu Guttenberg tells him that his is actually his top boss, the soldier doesn't care. His tour of duty ends in a month. So on the spur of the moment, zu Guttenberg decides to only have a volunteer army of people who want to be there.

The book is not great literature, and probably does not make much sense for people who do not know the zu Guttenberg case. But if you do and you read German, it is an amusing read.

Thursday, August 16, 2012

VroniPlag Wiki Case # 29

The cases of plagiarism in dissertations seems to be a never-ending story. VroniPlag Wiki's case # 29 was awarded the best grade and a prize at the TU Dresden in 2009. This thesis, submitted to the business faculty, deals with statistics and risk management. Currently, 32 % of the pages contain text or formulas that closely parallel other works.

Are the universities actually doing anything to fix this obviously broken system? Yes, they are having the submitters swear on oath that they did everything correctly. And purchasing software (that can only suggest text parallels, never determine the absence or presence of plagiarism).

The magazine "ZEIT campus" is currently publishing the results from a three-year investigation by the University of Bielefeld into how many students cheat. Their results (partially published at Zeit Online) are shocking: 79 % of students asked self-reported having cheated. It's only going to get worse if nothing happens.

Wednesday, August 15, 2012

The Case of Faheed Zakaria

The news from the States is full of stories about India-born Fareed Zakaria, an American journalist with Time and CNN who has been suspended in a plagiarism scandal. He has admitted to plagiarizing a number of paragraphs from a New Yorker article on gun control. His article, "The Case for Gun Control", is still online, as is of course the article by Harvard University history professor Jill Lepore, "Battleground America". CNN appears to have suspended him for similar plagiarism in a blog that has been taken offline.

I feel that this is a problem, to take stuff offline. I want to be able to examine the evidence for myself. My news feeder (Shameless plug for Highbeam) from the States led me to the Seattle Times' article "CNN's Zakaria sorry for plagiarism". There an example of the "plagiarism" is given, and I looked up, puzzled. This was statement of fact. The only text similarity was the name of the person quoted and the book quoted from.

I dug around for the articles and threw them at the comparison tool, SIM_TEXT, that we use on the VroniPlag Wiki.

Fareed Zakaria


Jill Lepore
Hmm. Wow. We have cases of plagiarism in Germany that have massively more copy and pasted that were found by their universities (BTU Cottbus and University of Heidelberg) to be only "technical weaknesses": Dd with 44 % of the pages affected: 2835 - 48 - 103; Nk with 75 % of the pages affected (many from her doctoral advisor): 17 - 66 - 81 - 90.

I am, of course, of the opinion that both universities listed above are in error and that these theses are grave plagiarisms, the doctoral titles should be rescinded.Fareed's sin is the selection of facts and quotes, that were lifted from the New Yorker. It is right to suspend him, give him some time to think about what he did here. I don't know about other texts that might contain similar material. But this is far, far less than the Jayson Blair case. But should he be fired? What do my readers think?

Update 2012-08-19: The Chicago Tribune reported on Aug. 16. that Zakaria has been reinstated both at Time and at CNN. I'm quite glad, as it would have been very difficult to explain large-scale copy and paste in Germany being okay, and a paraphrasing error in the US costing one their job.

Saturday, August 11, 2012

Inquiry results published

The Ludwigshafener hospital has published a press release about the results of the inquiry into the research of Joachim Boldt. He was the former "retraction king", having had to retract 88 of his publications. He has since been "de-throned" by Yoshitaka Fujii (as reported by Retraction Watch).

Boldt already left the hospital in November of 2010, after criticism of his research grew too loud to ignore. The board examining his papers needed a good 18 months to go through everything and determine that "in a large number of the studies investigated, the conduct of research failed to meet required standards. False data were published in at least 10 of the 91 articles examined, including, for instance, data on patient numbers/ study groups as well as data on the timing of measurements."

They try and play it down as being mostly a procedural thing, and are relieved that no patients came to harm. They promise that they have fixed procedures.

But I still have a few questions:
  • Where did the money for this research come from? Was this government money? Was it from a pharmaceutical company?
  • Has anyone used the since-withdrawn studies? That is, did anyone else quote his papers or try and replicate the experiments?
  • Is Boldt still permitted to practice medicine?
  • The hospital states that they will be monitoring future clinical studies - how will they be encouraging people to speak up about falsification of data? That has nothing to do with monitoring, which brings up notions of even MORE paperwork. How are they going to foster an environment in which people can question the research being done without fear of retaliation.
  • Why did this take 18 months?

Wednesday, August 8, 2012

Plagiarism Vocabulary

I was digging around looking for papers on how exactly plagiarism detection software works, when I was directed to "Classifications of Plagiarism Detection Engines", published in 2005 by Fintan Culwin and Thomas Lancaster in the online journal ITALICS 4(2). As a software engineer I was quite enjoying digging out the ancient papers quoted there about detecting plagiarism in programming exercises in FORTRAN. Ahh, those were the days, my first programming language.... and what a great use of Halstead's Software Science!

Then I realized that Thomas Lancaster had submitted his dissertation "Effective and Efficient Plagiarism Detection" in 2003 to the London South Bank University, London, UK. He has an excellent, detailed classification of the plagiarism detection systems available at that time, and a good overview of a lot of the technical papers that are to be found on the topic. The glossary alone is a joy to read, and I have asked for and received permission to repeat portions here. There are also a number of papers that Lancaster has published or prepared on the topic included in the appendix. Lancaster focuses in the thesis on a four-step process for determining plagiarism:
  • Collection stage - The first stage of the four-stage plagiarism detection process. This
    is where students submit their work to an electronic system so it can later be analysed for similarity.
  • Analysis stage - The second stage of the four-stage plagiarism detection process. Here all submissions are compared with each other (for intra-corpal plagiarism detection) or the external sources such as the Web (for extra-corpal plagiarism detection) to find submissions that are similar to each other or the Web sources.
  • Confirmation stage - The third stage of the four-stage plagiarism detection process. Here a tutor checks the pairs of student submissions that have been judged to be similar to see if they represent plagiarism or they represent legitimate shared citations or false hits. The tutor decides which pairs will go on to be investigated further.
  • Investigation stage - The fourth and final stage of the four-stage plagiarism detection
    process. This is where pairs of similar submissions have been found and they have
    been confirmed by human inspection to be similar and possible cases of plagiarism. In this case further evidence is collected, such as student interviews and marked up
    copies of the submissions and penalties are given.
My selection from the glossary (my favorite definition is in blue):
  • Academic plagiarism - Plagiarism carried out by academics, for instance copying journal articles and submitting them as their own work for possible career development.
  • Attribute counting metrics - A count of some property of a single document which
    might involve tokenisation. This has been redefined to remove the inconsistencies
    from the literature but is not considered a sensible classification.
  • Authorship attribution - The branch of linguistics that aims to calculate the author of
    a work based on knowledge of works by other known authors. This is not appropriate for plagiarism detection since there is no corpus of known work by a given student.
  • Characters Metric - A simple metric that measures the number of sequences of
    characters of a chosen length two documents have in common. 
  • Cheating - Unauthorised behaviour that is going against student etiquette when trying for an academic award or to gain an advantage over other students. Examples include plagiarism, use of cribs in exams and paying someone to complete an assignment specification on your behalf. 
  • Closeness Calculation - A computationally part of automated plagiarism detection
    where a single number is generated from a number of different metrics to decide how similar two submissions are.
  • Contractive plagiarism - Plagiarism where the source is larger than the copy and
    hence the source has been reduced in some way to create the student submission.
  • Corpal Metrics - A multi-dimensional metric that is a measure of a property of an
    entire corpus, for instance the proportion of submissions using a given keyword.
  • Collusion - Where two students discuss and work on an assignment specification
    together and complete elements of their final submissions together. This might be
    judged to be intra-cor[p]al plagiarism.
  • Direct copy - Two student submissions that are identical to one another with no
    attempt at disguise. One is a direct copy of the other. 
  • Disguise - Where a student has attempted to change a source and hand it in as their
    own submission so that the use of the original source won't be noticed.
  • Expansive plagiarism - Plagiarism where the source has been extended, either by
    adding new thoughts or adding filler words and phrases to make a student submission. 
  • Extra-corpal plagiarism - Plagiarism where the plagiarism source is outside the
    corpus of student submissions, for instance a Web site or material from a book.
  • False hits - Pairs of submissions that are ranked high enough for a tutor to investigate them but are judged to be dissimilar, thus being a waste of tutor time.
  • Free text plagiarism - Plagiarism that has been done in natural language, for instance, altering the words of another writer and presenting it as your own work.
  • Hybrid metric systems - A system that a combination of both attribute counts and
    structure metrics to find similar submissions. This has been defined to remove the
    inconsistencies from the literature but is not considered a sensible method of
    classification. 
  • Intra-corpal plagiarism - Plagiarism entirely within a corpus, primarily meaning two
    students who have copied from one another.
  • Missed pairs - A pair of submissions that contains plagiarism but is not automatically ranked in the upper portion of an ordered list of similar pairs and hence not investigated further by a tutor. 
  • Mosaic plagiarism - Plagiarism where chunks from different sources are used and rearranged in a way that could be considered like a mosaic is created from combining and arranging different pictures.
  • Multiply sourced - A student submission or external source that has been used in
    multiple student submissions.
  • Ostrich plagiarism policy - Where an academic institution states that plagiarism does not exist in their institution and has no formal way of dealing with it.
  • Paraphrasing - Using the ideas of another but rewriting them in your own words
    without suitable and continual acknowledgement. 
  • Plagiarism - Taking the words or ideas of another and presenting them as your own
    without suitable acknowledgement.
  • Proactive plagiarism policy - A policy of an academic institution where plagiarism is actively sought out on a regular basis, perhaps by using automated detection methods and cases are followed up when they are found.
  • Professional plagiarism - Plagiarism in a professional setting, for instance copying an internal report or company Web page from another source or using a service that
    writes standard CVs or job applications. 
  • Reactive plagiarism policy - The academic policy where plagiarism is not actively
    sought out but is taken seriously and followed up when it is identified during the
    course of marking.
  • Similarity - Where two submissions have words or ideas in common they are said to
    be similar. When they have been looked at by a tutor they may also be judged to be
    plagiarised.
  • Singularly sourced - A plagiarism source that has been copied from once only.
  • Source code plagiarism - Plagiarism of source code submissions, where two students
    have handed in programs where one has been derived from the other in some way.
    Detecting this is a well understood area since the constrained language reduces the
    number of possibilities that must be checked.
  • Structural Metrics - A metric that measures a property of one or more submissions
    where knowledge of the structure of the documents is needed.
  • Synthetic corpus - A corpus of documents that have been generated using synthetic
    means by taking sequences of words or characters in a known and defined order.
  • Thesaurising - A technique for plagiarism where words in a source are replaced by
    synonyms or changed in such a way that the submission makes the same points but
    the intention is that the plagiarism will not be discovered.
  • Visual metrics - A metric which is a based on some property of the similarity
    visualisation that would be generated for a given pair of student submissions.
  • Words Pair Metric - A simple metric that measures the number of sequences of word
    pairs in common between two documents. Identified as the most effective simple
    metric.
I find it immensely helpful to have terms that are generally understood when we are speaking about plagiarism. I would personally use "Synonomizing" instead of "Thesaurising" (which I can't pronounce). I also like the focus on the process of determining plagiarism and not the products - the software that is used in the process. Lancaster's focus in the thesis is on intra-corpal plagiarism and the visualization of similarity. It is well worth a read, if you are working in this area.

Friday, August 3, 2012

Scientific ghostwriting

The online journal Laborjournal has an interesting editorial on scientific ghostwriting (in German) that includes a number of legal cases in Germany and some interviews with ghostwriters. The author also contacted a company that says that they do not write doctoral dissertations. Writing as a medical student, they inquired as to the costs for a medical dissertation on a particular topic. They were given an offer of 5900 € for such a thesis, much less than the 7-10.000 € normally taken for a Diploma- or Masters-Thesis of about 100 pages.

The also include an interview with the authors of PlagScan, a so-called plagiarism detection system. They do concede that software only finds about 60-70 % of plagiarism, as we have demonstrated a number of times over the years. But the authors are confident that they can soon write a system that can even detect ghostwriting.

I'm not of that opinion. I know that I write differently when I am blogging than when I am writing a scientific paper. I have different styles. Trying to detect ghostwriting with software is just not possible. However, we as professors do have a chance to detect this if we have our students regularly submit drafts so that we can see progress. This, however, takes quite a lot of time, giving feedback on drafts. We have to meet with the students and ask hard questions like: where did you find this paper? We don't have that in our library....

And time is the problem. I don't know about you, but my day only has 24 hours. The more students I have, the less time I have to work with each one individually. If we are not reading the theses, why should the students write them? This creates a culture in which people can get away with submitting ghostwritten papers. Of course, what are they going to do when they have to write after their studies? They've never learned to do research, to structure, to write. So that will either ensure an expensive market for ghostwriters, or they will end up getting fired - if there is someone with the guts to call them out on their lack of skills.

The university system needs reforming, and it needs it yesterday. And not just in Germany, it would seem.