Tuesday, July 22, 2014

Belgrade Mayor plagiarizes doctorate

A new plagiarism scandal has erupted in Serbian politics. The scandal around the dissertation of the Minister of the Interior, Nebojša Stefanović, is still in full swing. Now the dissertation of the Mayor of Belgrade, Sinisa Mali, entitled “Creating Value Through the Process of Restructuring and Privatization – Theoretical Concepts and Experiences of Serbia” and submitted in 2013 to the University of Belgrade’s Faculty of Organizational Sciences has been documented to be heavily plagiarized.

, Professor of finance at the European Business School in Wiesbaden, Germany, documented the plagiarism in English on the Serbian site Peščanik in early July.


has put together an interactive graphical representation of the thesis with every page of Mali's thesis linked to the iThenticate report on the plagiarism found on that page. Even considering all the caveats about the use of plagiarism detection software, quite a number of sources, including the Wikipedia, have been identified.
If the protection of ideas is no longer important in our society, then we will gamble our future away.



Saturday, June 21, 2014

6IIPC - Conference

I previously reported about the pre-conference of the Sixth International Integrity and Plagairism Conference in Newcastle upon Tyne, I will now discuss some of the talks that I was able to attend.
There were four keynotes at the conference:
  • Toni Sant from Wikimedia UK spoke about student online research, aka using the Wikipedia. I was astounded at how many educators in the room were not very familiar with various aspects of how the Wikipedia is researched and written. Toni suggested that teachers have their students write articles for the Wikipedia – I strongly objected to that in the discussion, as the subsequent deletion of articles that are not encyclopaedic will frustrate the students.
  • Tricia Bertram Gallant, the academic integrity officer at UCSD, gave a fanstastic talk about integrity for the "Real World." She pointed out that people cheat, period. We have to quit pretending that we are only interested in academic integrity, that is, integrity that is only valid in school. Instead, we need to reframe our thinking and focus on building integrity for the real world and not just for school. Our students cheat and plagiarize because they are human, we need to help them obtain skills in acting in an ethical manner in any situation, not just academic ones.
  • Samantha Grant presented parts of her documentary about Jason Blair, a New York Times Journalist fired for plagiarism, called A Fragile Trust.  She and Teddi Fishman from the International Center for Academic Integrity then discussed questions that arose from the film. Samantha is now producing a game for journalists called Decisions on Deadline that presents ethical dilemmas for students to solve. The Society of Professional Journalists even has a hotline that journalists can call when they need to speak to someone anonymously.
  • Dan Ariely gave us the honest truth about dishonesty via video conference: We lie. We don't steal if given the opportunity, but if we think we can get away with something, we lie through the teeth, according to the many studies he has conductd. He suggests that we as educators need to teach our students about temptation and how to deal with it. 
In between the keynotes there were nine paper sessions of five papers or workshops each. Unfortunately, the program had some glitches, such as both papers about finding plagiarism in Arabic being scheduled in parallel with each other or three workshops in the area of embedding institutional policy and practice offered at the same time.
One talk was especially amusing: Rui Silva-Sousa from Portugal spoke about whistleblowers on plagiarism and the moral grey area. That is, he was speaking about GuttenPlag Wiki or VroniPlag Wiki, among others. He notes that there is currently a moral panic with respect to plagiarism. The general population perceives an increase of plagiarism among politicians on the basis of media coverage. This legitimizes the culture of control and people will now more than ever report wrongdoing, especially for egoistic reasons, on the part of people who are now in the public eye. He tried to explain the motivation of the researchers documenting plagiarism, and decided they are somewhere between weird mobbers and serious scientists. They must be acting on ethical egoism and through their making the cases public, can cause excessively harsh results in the life of the person who plagiarized. He felt that knowing the names of the whistleblowers would make it easier to judge the morality of their work. 
I noted in the discussion that he was completely ignoring the person whose work was plagiarized, and that a thesis was plagiarized irregardless of who speaks up about this fact. During a discussion over lunch we cleared up some misconceptions, the usual ones such as VroniPlag Wiki not only documenting politicians, and such. He admitted to not having looked at the sites that closely. I do wish that people would observe carefully before coming up with wild theories.
Mike Reddy, who teaches Games Development at the University of South Wales, gave a session on putting the "play" into plagiarism. We were to develop a game concerning some aspect of academic integrity within the hour. Our group didn't do too bad, we came up with a game we called "Freeloader", similar to Spoof, for 5–6 players (the size of a typical student project group).  Each person has three coins and behind their backs chooses how many coins to hide in their fist and put out into the middle, representing their contribution to the project. Each person starts out with three peanuts/candies/whatever. Each person guesses how many coins in total are now in the middle, no two guesses can be the same. All fists are opened and the coins counted – if you guess right, you get a candy from everyone else in the group. If you run out of candies, you get a dog's chance (one last round). If there are only two people left, the amount of candies you have to surrender upon being wrong is increased by one each round, so that there a winner is found quickly who will get the top grade (i.e. a stash of candies).
Phil Newton from the university of Swansea gave a workshop about paper mills and custom-writing companies. He showed live demonstrations of things that are available for sale. In a nutshell: If we are asking for it (such as research diaries or multiple revisions), there is someone out there willing to sell it, and the less time there is left to complete it, the more expensive it is. We got into groups and tried to come up with ideas that focus more on the learning and less in producing items that can be easily ghosted. The ideas ranged from only giving examinations, using peers to police, flipped classrooms, thinking positively, using progression portfolios, decreasing the price of doing the right thing, and increasing the fear factor: if we catch you, it will hurt. In all, we didn't come up with THE solution, but it was good to commiserate with others about the problem.

It was great to meet old friends and meet new people interested in plagiarism, although it was sad to have to miss so many sessions. The conference was co-sponsered by Turnitin and ICAI, so of course many of the talks dealt with Turnitin. It was rather shocking to see how many newish users were so sure that the so-called "similarity index" that Turnitin reports is the true value of "plagiarism" in a paper. Some schools even define Turnitin similarity index levels for determining the sanction to be meted out. However, people with more experience using the system often temper their words, they understand that the number does not mean anything, really, and that the software is just a tool. Even Turnitin has started to speak of itself as a text-matching software in some instances. I suggested to one of the Turnitin top brass that they ditch the number and focus on what their system does best: find matching text strings (and not plagiarism!). Turnitin has just recently been acquired by a venture capital company, so they have some money to invest in making the product better. I hope that the focus will be on the usability and the reports and not on suggesting that they find more plagiarism. The decision as to whether something is to be considered plagiarism or not must rest firmly with the instructor and the institution, not with a software package.

Jonathan Bailey has blogged extensively on Day One - Day Two - Day Three of the conference. 

Tuesday, June 17, 2014

6IIPC - Pre-conference

[I wanted to blog directly from the Sixth International Integrity and Plagiarism Conference in Newcastle, but Google cooked up a cool new idea to try and force me to give them my telephone number. Since I was logging in from a different location and a different computer (duh), I needed to give them my telephone number in order to get at my account. I could not reset the password until 3 days after I last logged in. I have one account for email (which I had just checked before) and one for blogging that I wanted to switch to. It took an email to Google to get me to a place where I could say: yes, that was me, which is what I would say if I was breaking in, too.]

Day 1: Starting out
The pre-conference workshop was about creating a plagiarism policy.

  • Randa Al-Chidiac, from the Holy Spirit University of Kaslik, Lebanon, is a librarian who spoke about instituting a plagiarism policy at a school that does not currently have one. She made it clear that it is vital to have upper management be committed to the project. A school needs to define for itself what is acceptable academic practice and what sorts of sanctions they plan on meting out. The process has to be understandable, transparent and fair. She suggests that the libraries take charge of the situation, as they are publishing the theses that are potentially found to contain plagiarism. They need to speak with faculty, and develop flow charts for faculty to follow. Only as a last step should technology be introduced.
  • Loc Pham Quoc, from the Hoa Sen University in Hồ Chí Minh City, Vietnam is the dean of the Faculty of Languages and Cultural Studies. The university was the first one in Vietnam to take action for promoting academic integrity, so they were able to get media interest aroused. He set up a club called FACE (For A Clean Education) and has involved the students (and their parents and potential employers) in many activities from public discussions to designing posters for promoting academic integrity to training 20 students as communicators that help their fellow students understand how to avoid plagiarism. He noted that plagiarism is NOT rooted in culture, as the dominant culture can change. Vietnam has had many cultures: Chinese, French, American. And now they are beginning to take action to promote a culture of good academic practices.
  • Wole Morenikeji, from the Federal University of Technology, Minna, Nigeria, spoke about the program instituted at his university. They appear to rely on plagiarism detection software and have rules about the “amount” of “similarity” that can be tolerated. This sparked quite a debate about the (mis-)use of these numbers generated from plagiarism detection software for the purpose of sanctioning.
  • After lunch two people from Turnitin spoke, and rather repeated what Randa Al-Chidiac said, so there is no need to repeat it here. Then there was a discussion panel that brought up some important points. Accreditation boards have actually made universities change rulings that they have previously made in academic integrity cases. And one vice-chancellor from Afrika noted that his university does not have Internet, because they do not have stable electricity. His students write their papers with their mobile phones, but the teachers have no Internet for checking them. He is currently installing solar panels in the hopes of soon being able to have 24/7 Internet available.
  • The first keynote was from Adrian Slater who was discussing collaboration and group work and plagiarism. He tried an exercise to get the room active, and people did participate, but since there was no time to really discuss the issues and there were so many questions to look at, the whole session remains a muddle in my mind.
  • After a nice reception a group broke out to go to a local bar that is on the sixth floor and completely in glass. We had a marvelous view of the Tyne and the town while having a beer and discussing paper mills and ghostwriting. There was not a lot in the way of solutions found, but it was good to commiserate and to hear that others are also grappling with this problem for which there is no software-based solution.

Thursday, June 12, 2014

Serbian Soap-Opera

The Serbian blog "Balkanist" posted an English-language version of a text that appeared in Peščanik, documenting plagiarism found in the dissertation of Minister of Internal Affairs, Nebojša Stefanović.
In Serbia, many have obtained degrees of higher education through bribery or other connections. Such degrees are particularly popular amongst politicians and party officials. Therefore, as citizens as well as academics, the three of us have decided to conduct a detailed analysis of some possibly problematic degrees earned by some of Serbia’s top public figures. We’ll start with the most glaring and recent example – Dr Nebojša Stefanović, former President of the National Assembly of Serbia and now Minister of Internal Affairs.
The authors wondered, how Stefanović was able to write a dissertation in a short time while president of the national assembly. They investigated – and found plagiarism. Lots of it.

Premier Aleksandar Vučić rushed to his defense, according to the Austrian daily newspaper Die Presse. The accusations were the most stupid ones he had ever heard. And since he himself has a doctorate with the best grade in law, he was in a position to judge. Stefanović's advisor, Mića Jovanović who is Rector of the private Megatrend University in Belgrade, also chimed in in defense of Stefanović. [This university, by the way, has been accused of being a diploma mill, as the New York Times reported, and has also awarded an honorary doctorate to dictator Muammar Gaddafi.]

Peščanik quickly suffered a "hacker attack," Die Presse notes [more probably, their web server was not able to survive the number of requests from the curious]. When they returned online, they reported that they wanted to investigate one of Jovanović's two dissertations, the one supposedly granted from the University of London. Unfortunately, the thesis cannot be found. They even asked his advisor, Stephan Wood, how they could obtain a copy. Wood stated that Jovanović had studied at the London School of Economics and submitted a dissertation, but that the viva had been failed and thus the thesis had been returned to the author for major revisions. A revised version was never submitted, and thus Jovanović does not have a doctoral degree from the LSE.

Jovanović called this all a pack of lies, and has posted a picture of the title page of his dissertation on a web page of the university. Balkanist notes, according to Die Presse, that this only proves that he wrote a dissertation, not that he has a doctorate. The topics of both dissertations, the one purportedly from LSE and the second one, submitted to the Serbian University of Maribor, are essentially on the same topic. They offer, should he have lost his doctoral diploma, to obtain a new copy at their own expense. Jovanović need only give them a power of attorney.

Monday, June 9, 2014

Dissertation mining

The past few weeks have certainly been quite stressful for the medical school of the University of Münster in Germany. VroniPlag Wiki began reporting on plagiarism in 21 dissertations to date that were submitted to the school in the years 2004 – 2011. The findings even include a chain of three plagiarized dissertations: Gt (2010) is a plagiarism of Ckr (2009) on 100% of the pages. The Ckr thesis contains plagiarism on 94% of the pages, including Gb (2008), which in turn is a plagiarism of a thesis submitted 2007. All four theses were prepared with the same doctoral advisor. Another cluster of five theses that repeat material from each other, with another advisor, has been documented (Tmm/40 pages/47%; Aeh/15 pages/86%; Clm/27 pages/62%; Clg/21 pages/80%; Amh/21 pages/52%).  In addition to the plagiarism, evidence of data falsification has also been found in some of the theses.

How were these theses identified? And why were so many found in such a short time?

It was a rather simple application of data mining techniques to dissertations that are available as open access digital publications from university libraries. Medical dissertations were chosen, as there are a large number of them available and they often deal with similar topics. Many theses in the past 10 years are available as open access publications from the university libraries. The theses are also often painfully short, sometimes even consisting of just one publication by a research group that one of the authors submitted as their dissertation.

Volker Rieble, a German law professor, discussed open access repositories in his 2010 book Das Wissenschaftsplagiat: Vom Versagen eines Systems (p. 52ff). The book has unfortunately been taken off the market, as one of the persons named as a plagiarist won a lawsuit filed against Rieble. He argues that open access repositories, especially ones operated by universities, should be taking measures to make sure that their authors are not being plagiarized if their texts are being openly offered. He feels that this publicly available material is a simple invitation to plagiarize. Of course, he does recognize that open access could help discover plagiarism, but he pointed out that no one was taking any action against copyists.

Well, now someone has. The work of VroniPlag Wiki in the past three years has shown that there is extensive plagiarism in dissertations and other academic texts throughout Germany, in all fields, and done in many different ways. The cases in Münster were discovered using a collusion identification method applied to open access dissertations.

While reading an early version of my book, one of the VroniPlag Wiki researchers stumbled over the section on collusion. What exactly was that? Collusion is when two or more students cooperate in producing materials in situations in which they were expected to work alone. For example, two students write a program together and each turns it in as their own work. Or five students in a very large course cooperate to write a paper together and then each turns in his or her own slightly modified version. Students hope that the teachers will not be reading carefully (or not at all?) and thus will not identify the "work-saving" efforts. It is not necessary for the participants to knowingly participate in the collusion. If author A re-uses text from author B without B being aware of the situation, this would also be considered collusion.

The researcher noted that he could imagine students doing such a thing, but no doctoral candidate would be so careless as to do something like that, would they, especially when they plan on publishing online? Would people collaborate on a dissertation, each submitting their own copy, or each writing half of the dissertation, or would someone copy another dissertation from the same school or even the same professor? Unimaginable. But there was a precedent.

There was a case of collusion discovered at the medical school in Münster in 2011 that was identified by a Wikipedia author who stumbled upon two practically identical dissertations that were submitted three years apart – to the same examiners ([1] submitted in 2009 and since withdrawn, is a copy of [2] from 2006). This was found just after Germany was rocked by the Minister of Defense, Karl-Theodor zu Guttenberg, stepping down after his dissertation was found to be extensively plagiarized.  The dean of the medical school in Münster emphasized then in a press release that plagiarism in a dissertation was an absolute singularity. He also noted that they would be looking into punishing the advisor, perhaps by barring him from taking on doctoral students in the future.

Would it be possible to check whether it is indeed true that such a plagiarism is a singularity? After all, many theses are, indeed, available online. All a dishonest author would need to do would be to download one or more theses, touch them up, and submit them. Since they are apparently not read closely (or why is such a thesis acceptable in Münster? The formatting from PDF page 14 is so erratic as to make the text unreadable) this might seem a good strategy for someone who is trying to get that "Dr." with as little effort as possible.

Intra-University Clusters
The first step in identifying collusion within a university department is to obtain a good number of theses from a university and then check each one against all the others from the same school. A list of the dissertation-granting medical schools in Germany was quickly found online. An attempt was made to download medical theses for a selection of these schools, including Münster.

As is usual for data mining applications, the most time-consuming part of the exercise is getting the data ready for work. The university libraries' offerings of digital publications are of quite varying quality. Some offer wonderfully clean metadata with URLs to the entire thesis; others have chaotic catalogs, upload the same dissertation more than once under different names, or for some reason split a thesis into chapters. The names of the files are quite amusing, as they appear to be named by the candidates themselves: "copyshop-fassung.pdf" [copyshop version], "dissertation_finish.pdf", or just "doktor.pdf". Most are called "dissertation.pdf" or "doktorarbeit.pdf".

Since the main piece of software compares each thesis with all the others, the number of comparisons grows quadratically with the number of texts examined. Comparing only a few dissertations with each other only takes minutes, but as the number of dissertations examined increases, the time quickly grows to days or even months.

The results of the comparisons are not an automatic plagiarism determination: only identical text sequences are identified. Each and every suspicious pair of theses needs to be investigated manually. Often, both authors identified their thesis as joint work or the text is a direct quote, so this is not a plagiarism. Or both had a copy of the same questionnaire in the appendix and a very similar literature list that is responsible for the text similarity. Or two copies of the thesis were uploaded to the library database under different names. But occasionally, there is no such explanation for the numerous and at times extensive swaths of identical text. And so, researchers with VroniPlag Wiki began to document the theses – manually.

Manual documentation of plagiarism involves locating the text overlap positions, recording the overlap, and having a second researcher sign off on the documentation. Once a potential source for a thesis has been located, the text comparison tool SIM_TEXT that researchers at VroniPlag Wiki implemented so that it can run locally in the browser can be used to identify the positions of the text overlap. These are documented as fragments, recording the page and line numbers, and documenting the portion of text similarity in both the source and the potential plagiarism.

The result lists from the comparisons can be sorted by amount of text overlap, so that one can work down from the most extensive ones. Investigating Münster, quite a number of theses turned up that were able to be rapidly documented, as the theses were quite short and the text copying was often page-wise.

The University of Münster has set up an investigative committee that includes external experts for what the press speaker has termed a "conflagration" (Flächenbrand). The committee is to convene in July. The dean is quoted in the press as being extremely irritated by the number of cases documented, the head of the medical association of Westfalen-Lippe is quoted in the same article as stating that since it is expensive to train doctors and they are urgently needed, it would be a "waste of labor" to demand that medical students spend two to three years working on a dissertation. I respectfully request, then, that medical students just quit producing sham dissertations. They should be awarded an "M.D." upon finishing their studies and let those interested in furthering science and academics invest their labors in producing dissertations that are original work.
Die Ausbildung zum Mediziner ist teuer. Mediziner werden dringend gebraucht. Da sei es eine "Vergeudung von Arbeitskraft", wenn von einem Studenten verlangt würde, zwei, drei Jahre an einer Doktorarbeit zu arbeiten – wie in anderen Fächern üblich.

Münster - Münstersche Zeitung - Lesen Sie mehr auf:
http://www.muensterschezeitung.de/staedte/muenster/48143-M%FCnster~/Kammerpraesident-ueber-Plagiate-in-der-Medizin-Mogel-Aerzte-muessen-nicht-mit-Strafen-rechnen;art993,2384095#plx1248430946
Die Ausbildung zum Mediziner ist teuer. Mediziner werden dringend gebraucht. Da sei es eine "Vergeudung von Arbeitskraft", wenn von einem Studenten verlangt würde, zwei, drei Jahre an einer Doktorarbeit zu arbeiten – wie in anderen Fächern üblich.

Münster - Münstersche Zeitung - Lesen Sie mehr auf:
http://www.muensterschezeitung.de/staedte/muenster/48143-M%FCnster~/Kammerpraesident-ueber-Plagiate-in-der-Medizin-Mogel-Aerzte-muessen-nicht-mit-Strafen-rechnen;art993,2384095#plx89862196

There is also an interesting collection of statistics on dissertations in Münster put together by a VroniPlag Wiki researcher in an attempt to try and understand what may have caused this extreme cluster of plagiarism. What one sees here, though, is that the number of dissertations submitted has declined, as has the number of online publications.

Münster is not the only university that has been shown to have accepted massive plagiarisms. A thesis from the Charité in Berlin was recently posted (Ali) that has more than 75% plagiarism on all (100%) of the pages. It is also evident that data was falsified in this thesis, as the numbers of patients interviewed are different from the older thesis, but the percentages given are the same ones in the older thesis, not for the numbers published in the thesis itself. When Spiegel-Online questioned the doctoral advisor about the thesis, he could only vaguely remember it. The Charité is currently investigating.

Inter-University Clusters
After investigating theses from just one university, clusters from two or more different universities can be combined in order to see whether there has been any "borrowing" of text between the universities. This is an extremely time-consuming process, but it turns up fascinating results. Two theses have been found that are around three-quarters identical that were handed in within a few weeks of each other to two different universities under different advisors. If this was joint work, it is not mentioned in either thesis. There are quite a number of theses that are patchwork quilts of text from different universities. There is a 30-page thesis submitted to Mainz (Tz), of which over half of the pages are from a thesis submitted seven years prior to Gießen.

There are so many text identities that have been found, it would take an enormous effort to document them all. But it has been shown that it is possible, using a rather simple (if time-consuming) method, to detect collusion plagiarism. Universities that publish epubs should at least make sure that they are not re-publishing material before they put a text out in public. After checking against their own text collection, perhaps a test against a selection of other university libraries is worth the investment of time. And at the risk of sounding like a broken record: the examiners should actually read the theses and perhaps keep better track of their students and the topics they pose.

The next blog article will be quite technical and explain the methodology used to find these collusion plagiarisms.

P.S. While finishing up writing this blog post today, medical dissertation #22 from Münster was posted, Aaf. The 48% of the 31 pages that have text overlap appear to be taken from a thesis submitted one year previously. The text has been disguised by substituting synonyms and re-wording sentences. This makes it difficult for software to identify the thesis as a possible plagiarism, although there are some longish portions that are taken verbatim. Page 13 shows a problem that appears when plagiarized text is rephrased: the original author writes that S. aureus appears to increase the mortality rate. That word was left out of the reworded text in Aaf, making it appear to be a known fact.

Sunday, May 4, 2014

Test of the Picapedia System

Stefan Weber noted recently that the plagiarism research group in Weimar has tweaked their experimental system picapica. We tested the system back in 2007 while it was in a beta stage. We have looked at it occasionally in recent years, but were not able to test it for various reasons. The system back in 2007 took quite a long time to return very little in the way of finding sources.

They have now focused the system on comparing a text with the current version of the Wikipedia in 10 different languages,  English, German, Spanish, Catalan, Basque, French, Italian, Dutch, Portuguese, and Swedish. That would make it a useful tool for teachers looking for Wikipedia copies in student papers, as students will tend to take the current version of an article and not a historic one. 

I decided to test the system using the 2013 test cases that were based on the Wikipedia, as well as a translation from a Wikipedia, an original work, and a plagiarism with no Wikipedia sources. Both .pdf and .doc files were used. The results:
  • 21-Tibet (3 sources including WP:DE): WP source found, and another Wikipedia article with an identical phrase
  • 33-Eyjafjallajoekull (1 source, WP:EN): WP source found
  • 36-Champagne (translation from WP:FR): nothing found
  • 43-Brüder-Grimm (2 sources, one WP:DE): WP source found
  • 45-Strelitzia (highly disguised plagiarism from WP:EN): WP source found
  • 46-Thermoskanne (25% automatically disguised plagiarism from WP:EN): WP source found
  • 47-Tessellation (4 sources, none WP): no sources found
  • 50-Union-Jack (original paper with correct references to WP:DE): It notes a similarity to the Union Jack lemma, but does not flag any reused text.
  • 51-London-Blitz (4 sources, one from WP:EN): WP source found
  • 52-Boxer-Rebellion (disguised plagiarism from WP:DE): WP source found
  • 57-Fallingwater (1 source, WP:DE): 2 possible WP sources named, one is correct
  • 58-Phillip-K-Dick (disguised plagiarism from WP:DE): WP source found
  • 60-Rolltreppe (3 sources, 1 WP:EN properly referenced): It notes a similarity to the Escalator lemma, but does not flag any reused text.
  • 63-Hebrew-Plag (1 source, WP:HE): nothing found, this language is not given as a possible one
The system did not always find the total amount of plagiarism, but it pointed to the correct source in all cases except the impossible ones (#36, #47, #63). It also did not report plagiarism for correctly quoted Wikipedia, something many systems do not get right.

The text is first uploaded to their server (in Germany) and deleted after examination, according to their privacy policy. However, they do keep search results and thus may have portions of the uploaded text stored in some form. I repeated the test a day later and saw no trace of the results from the previous day influencing the repeat tests. They do record IP addresses and use Google Analytics, but offer the service free of charge (even for commercial use) as long as it is not abused and as long as the user does not pretend that they developed the system themselves.

So the system does appear to be quite useful for a small subset of plagiarism detection problems, namely identifying text that has been taken from a current Wikipedia. If it is necessary to look for text in older versions of Wikipedia articles, the tool WikiBlame can be quite useful for identifying and dating text taken from the Wikipedia many years prior.

Saturday, May 3, 2014

Stormy Waters

It all began with a Facebook posting on April 22, 2014: Arne Janning posted a longish article to his friends asking for help. He had found a recent book by two prominent historians (Karsten, A. & Rader, O. B. (2013) Grosse Seeschlachten -- Wendepunkte der Weltgeschichte von Salamis bis Skagerrak. München: C.H. Beck) to contain plagiarism from the Wikipedia. He exaggerated by saying that "every page contained plagiarism", and wondered what he should do.

The first thing Janning should have done was perhaps to check his privacy settings, as his post was public and the case quickly caught fire and was widely reported on. Maritime puns seem to be the norm for the titles of the articles, as I have also chosen: Spiegel Online ["Abschreiben bei Wikipedia: Zwei Historiker geraten in Plagiatssturm"], Neue Zürcher Zeitung, ["Seeschlacht mit unzulässigen Beibooten"], Süddeutsche ["Wendepunkte der Weltgeschichte aus Wikipedia kopiert"], FAZ ["Unter der Flagge Wikipedias"]. The authors and the publisher promptly threatened Jennings with legal action. According to Spiegel Online, one of the authors, Radar, noted that he did not actually steal intellectual property, as he only used "technical details" from the Wikipedia. "In earlier days we used the Brockhaus [encyclopedia], today we use the Wikipedia," he is quoted as stating [translation dww].

The blog Erbloggtes noted that there were at least two pictures used from the Wikipedia as well as some text, and the pictures were printed without attribution. That is a definite copyright infringement, although one of the pictures was indeed in the public domain, the other was not. Many other blogs joined the discussion: Archivalia, Schmalenstroer, hellojed, plagiatsgutachter. The Wikipedia-Kurier discussion was, as so often, extensive.

The publisher soon decided to withdraw the book, as reported by BuchMarkt, Meedia, and others. Beck ran the book through plagiarism detection software (iThenticate) and declared the parts written by Arne Kasten to be "free from unmarked quotations", despite the fact that it is impossible to prove the absence of plagiarism. One can only demonstrate the presence of plagiarism by a synoptic documentation showing the plagiarism and the source together. The other author, however, had not only plagiarized from the Wikipedia, but from an article published online in 2003. The publisher notes in a pseudo-scientific manner the "exact" word counts and percentages found, although I have repeated shown in my work (for example, my 2013 test) that such numbers are meaningless. Additionally, a reader cannot tell which parts of this book were written by which author, so they both are responsible for the entire book, in my opinion.

Beck also couldn't resist bashing Janning, still threatening legal action, perhaps to deflect criticism from itself for not having properly edited the book. A good comment by Jörg Hopfgarten in the Boersenblatt notes the publisher would be better off to understand that this was just an angry customer blowing off steam, ranting. Customers have a right to do just that without consulting a lawyer, especially when it can easily be seen that they are at least partially right. Amazon is full of similar reactions, this was just the media picking up on the keyword "plagiarism" and running with it, without having independently verified the accusations. Indeed, none of the Seeschlachten books in the Berlin libraries were out on loan when I obtained a copy, although perhaps they all purchased the Kindle version.

Beck closes their press notice with a condescending offer to "participate in a discussion about the use of the Wikipedia in academics." Jan Englemann notes on the Wikimedia blog that the discussion is for all practical purposes already over, as there are numerous court rulings on the legality of the Creative Commons license that the Wikipedia articles are under, CC-BY-SA. It is, perhaps, time for publishers to understand how a legal use of Wikipedia texts works: Link to the license and authors, and put the material that uses a Wikipedia text under at least this license. Open licenses do not mean that the material is free to be misappropriated.

Many of the blogs discussing the topic have started documenting the plagiarized portions, in particular using a German system Picapica (also called Picapedia), that compares text to the current version of the Wikipedia. I will be bringing a short test of this system in my next blog entry.