Share to: share facebook share twitter share wa share telegram print page

User talk:The Earwig


__DTSUBSCRIBEBUTTONDESKTOP__{"headingLevel":2,"name":"h-DoubleGrazing-20240508093500","type":"heading","level":0,"id":"h-Copyvio_Detector_and_Google-20240508093500","replies":["c-DoubleGrazing-20240508093500-Copyvio_Detector_and_Google"],"text":"Copyvio Detector and Google","linkableTitle":"Copyvio Detector and Google"}-->

Copyvio Detector and Google

__DTELLIPSISBUTTON__{"threadItem":{"headingLevel":2,"name":"h-DoubleGrazing-20240508093500","type":"heading","level":0,"id":"h-Copyvio_Detector_and_Google-20240508093500","replies":["c-DoubleGrazing-20240508093500-Copyvio_Detector_and_Google"]}}-->
__DTSUBSCRIBEBUTTONMOBILE__{"headingLevel":2,"name":"h-DoubleGrazing-20240508093500","type":"heading","level":0,"id":"h-Copyvio_Detector_and_Google-20240508093500","replies":["c-DoubleGrazing-20240508093500-Copyvio_Detector_and_Google"],"text":"Copyvio Detector and Google","linkableTitle":"Copyvio Detector and Google"}-->

Hi,

(Sorry if this is the wrong forum for asking, but if so, perhaps you could point me in the right direction?)

I use the Copyvio Detector (great tool, BTW!) in checking new AfC drafts, at least a dozen times most days. I sometimes get an error message saying that the detector has exceeded its maximum allowed Google searches. This issue has always been there, occasionally, but in the last week or two it has occurred daily. When I start reviewing, around 6am or so UK time, the first few reviews always hit this problem. Then, maybe 8am (?) the daily quota probably gets reset, or something else happens, because from then onwards everything is fine until the next morning.

So I was thinking, I don't suppose there's much we can do to increase the quota (?), but would it be possible to add another search engine as a fallback option? Either so that when the user gets that error message, they could manually tick a box to use Bing (say) instead; or maybe the Detector could automatically switch to using the alternative if Google has failed.

I realise this may not be possible, either for technical or policy reasons, but thought I'd ask at least. Cheers, -- DoubleGrazing (talk) 09:35, 8 May 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240508093500","author":"DoubleGrazing","type":"comment","level":1,"id":"c-DoubleGrazing-20240508093500-Copyvio_Detector_and_Google","replies":["c-The_Earwig-20240508161100-DoubleGrazing-20240508093500"]}}-->

Hi DoubleGrazing, using Bing or some other engine as a fallback is definitely something we’ve discussed—I hadn’t realized the issue had gotten this bad recently. The main issue here is these services usually cost money, and while the WMF pays for our Google access right now, I don’t know if I will be able to ask for access to additional search engines. First, I can take a deeper look into whether anyone is overusing their share of the tool’s resources; we might need to block/limit them. (Our plan with Google allows about 1500 articles to be checked per day.) — The Earwig alt (talk) 16:11, 8 May 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240508161100","author":"The Earwig","type":"comment","level":2,"id":"c-The_Earwig-20240508161100-DoubleGrazing-20240508093500","replies":["c-DoubleGrazing-20240508170500-The_Earwig-20240508161100"]}}-->
Okay, thanks for shedding some more light on this; needless to say, I knew nothing about how these things work.
I guess we at AfC are taking up quite a chunk of that quota, given that we see what are by definition new drafts usually by new users. I for one run the check probably at least on ⅓ of the drafts I review (and if you think that makes me an overuser, feel absolutely free to point this out, of course!). Even at NPP we deal with relatively more experienced users, so there's that much less of a need to check for CV.
It may be that I see the problem worse than some others, mind, because of my weird early-morning AfC habit, combined with the time zone I'm in. -- DoubleGrazing (talk) 17:05, 8 May 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240508170500","author":"DoubleGrazing","type":"comment","level":3,"id":"c-DoubleGrazing-20240508170500-The_Earwig-20240508161100","replies":["c-DoubleGrazing-20240704123500-DoubleGrazing-20240508170500"]}}-->
Hi again,
Quick update on this, the problem (of the copyvio detector running out of Google quota) has lately become worse. Unlike before, when it would only manifest in the early morning UK time, and usually be fine after 8am UK / 0700 UTC, it's now happening also in the afternoon. This is relatively new, maybe in the past week or two, so I've not yet have a good feel for what time it happens exactly (in case that matters); I would have said late afternoon, but eg. today it started already around 1pm UK / 1200 UTC.
Best, -- DoubleGrazing (talk) 12:35, 4 July 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240704123500","author":"DoubleGrazing","type":"comment","level":4,"id":"c-DoubleGrazing-20240704123500-DoubleGrazing-20240508170500","replies":["c-The_Earwig-20240719064300-DoubleGrazing-20240704123500"]}}-->
Sorry taking a while to get back, but I'm actively working on an improvement for this now. — The Earwig (talk) 06:43, 19 July 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240719064300","author":"The Earwig","type":"comment","level":5,"id":"c-The_Earwig-20240719064300-DoubleGrazing-20240704123500","replies":["c-DoubleGrazing-20240719103500-The_Earwig-20240719064300","c-Asilvering-20240823224500-The_Earwig-20240719064300"],"displayName":"The\u00a0Earwig"}}-->
Great to hear, thanks. :) DoubleGrazing (talk) 10:35, 19 July 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240719103500","author":"DoubleGrazing","type":"comment","level":6,"id":"c-DoubleGrazing-20240719103500-The_Earwig-20240719064300","replies":[]}}-->
Do we really still have the same quota we've had for months? (or years?) As in, are we sure it hasn't been reduced? I haven't had a copyvio check go through with the search engine box checked in what seems like weeks. I can't imagine there are suddenly so many new page patrollers that it's making that much of a difference, but... -- asilvering (talk) 22:45, 23 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240823224500","author":"Asilvering","type":"comment","level":6,"id":"c-Asilvering-20240823224500-The_Earwig-20240719064300","replies":["c-Asilvering-20240823224700-Asilvering-20240823224500"]}}-->
Oh. But what has really taken off in the last several months is AI. Nevermind. I think I've answered my own question. ugh. -- asilvering (talk) 22:47, 23 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240823224700","author":"Asilvering","type":"comment","level":7,"id":"c-Asilvering-20240823224700-Asilvering-20240823224500","replies":["c-Novem_Linguae-20240823230600-Asilvering-20240823224700"]}}-->
I think we were discussing this on WP:VPWMF a few weeks ago, and the idea of making everyone log in using OAUTH came up. If bots are indeed the problem, I think this is a good idea to try. –Novem Linguae (talk) 23:06, 23 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240823230600","author":"Novem Linguae","type":"comment","level":8,"id":"c-Novem_Linguae-20240823230600-Asilvering-20240823224700","replies":["c-The_Earwig-20240824000900-Novem_Linguae-20240823230600"]}}-->
Yes, we're actively working on this. — The Earwig (talk) 00:09, 24 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240824000900","author":"The Earwig","type":"comment","level":9,"id":"c-The_Earwig-20240824000900-Novem_Linguae-20240823230600","replies":["c-Asilvering-20240824002600-The_Earwig-20240824000900"],"displayName":"The\u00a0Earwig"}}-->
Thanks, and good luck! -- asilvering (talk) 00:26, 24 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240824002600","author":"Asilvering","type":"comment","level":10,"id":"c-Asilvering-20240824002600-The_Earwig-20240824000900","replies":["c-The_Earwig-20241005152000-Asilvering-20240824002600"]}}-->
Hey DoubleGrazing and asilvering. With substantial help from Chlod, we've released a change to require logging in to use the search engine option in the tool. (It uses OAuth, and it should redirect you automatically when running a new check.) This is still new, but it looks like this has eased our usage enough that the tool should not run out of quota so often. — The Earwig (talk) 15:20, 5 October 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20241005152000","author":"The Earwig","type":"comment","level":11,"id":"c-The_Earwig-20241005152000-Asilvering-20240824002600","replies":["c-Asilvering-20241005174700-The_Earwig-20241005152000","c-DoubleGrazing-20241005190700-The_Earwig-20241005152000"],"displayName":"The\u00a0Earwig"}}-->
Brilliant, thanks so much. -- asilvering (talk) 17:47, 5 October 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20241005174700","author":"Asilvering","type":"comment","level":12,"id":"c-Asilvering-20241005174700-The_Earwig-20241005152000","replies":[]}}-->
Sounds good, thanks! Already tried it and seems to work well. Glad to hear it's taking some of the pressure off the quota. Cheers, -- DoubleGrazing (talk) 19:07, 5 October 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20241005190700","author":"DoubleGrazing","type":"comment","level":12,"id":"c-DoubleGrazing-20241005190700-The_Earwig-20241005152000","replies":[]}}-->
__DTSUBSCRIBEBUTTONDESKTOP__{"headingLevel":2,"name":"h-Liz-20240719223100","type":"heading","level":0,"id":"h-Copyright_violation_tool-20240719223100","replies":["c-Liz-20240719223100-Copyright_violation_tool"],"text":"Copyright violation tool","linkableTitle":"Copyright violation tool"}-->__DTELLIPSISBUTTON__{"threadItem":{"headingLevel":2,"name":"h-Liz-20240719223100","type":"heading","level":0,"id":"h-Copyright_violation_tool-20240719223100","replies":["c-Liz-20240719223100-Copyright_violation_tool"]}}-->
__DTSUBSCRIBEBUTTONMOBILE__{"headingLevel":2,"name":"h-Liz-20240719223100","type":"heading","level":0,"id":"h-Copyright_violation_tool-20240719223100","replies":["c-Liz-20240719223100-Copyright_violation_tool"],"text":"Copyright violation tool","linkableTitle":"Copyright violation tool"}-->

Hello, The Earwig,

I regularly used this tool you created, mostly when patrolling drafts or CSD-tagged articles, I'd probably used it 3 or 4 times a day. When I used it too much, I'd get a message that I was over my limit of how often I could use it. At least that's how I thought things worked. Now, I get this message every time I try to see whether a page is a copyright violation, I have not gotten a successful response to a query in many, many weeks now. So, I'm wondering is this "limit" actually for all users on this platform and not tied to individual editors? Because something odd is going on and maybe new page patrollers or AFC reviewers are using it for every article they review if I can not just get one or two reports on suspicious articles or drafts I've come across. I know with AI, there are ways users can get around copyright restrictions but I still found the tool helpful.

Do you have any idea why it is suddenly no longer available to generate reports? Can you tell me the time of the day when it "resets" so that maybe I could make inquries then? Or is there any possibility of raising this limit of reports generated? I mean, I'm glad it's become so popular but it has also become unavailable for use for those of us who just want to make a few queries a day. Thank you. Liz Read! Talk! 22:31, 19 July 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240719223100","author":"Liz","type":"comment","level":1,"id":"c-Liz-20240719223100-Copyright_violation_tool","replies":["c-The_Earwig-20240720004300-Liz-20240719223100"]}}-->

Hi Liz, truly sorry about the ongoing issues. I'm aware and working on it (see some of the threads above you), with the time I have available. I thought things has improved with the overall performance improvement last month, but it has really just made this particular problem of running out of the search quota much worse. Anyway, I am working on it now.
To answer your questions: yes the quota is shared by all users, and we cannot easily raise it. It's a hard limit enforced by Google that I cannot bypass without some special arrangement. It resets I think around midnight Pacific Time, i.e. Google's time zone.
I think the issue is some bots/automated traffic making too many queries. In the past I have been able to block them or ask them to slow down, but that approach has become less effective lately. So, I will be adding authentication to the tool to make sure only logged in users can use it and I can more accurately identify who is overusing it. I expect to finish that work this weekend and I am hopeful that will solve the issue. If it doesn't, there are other things I can try. — The Earwig (talk) 00:43, 20 July 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240720004300","author":"The Earwig","type":"comment","level":2,"id":"c-The_Earwig-20240720004300-Liz-20240719223100","replies":["c-The_Earwig-20240722051400-The_Earwig-20240720004300"],"displayName":"The\u00a0Earwig"}}-->
Update: I am still working on this, but have made progress. — The Earwig (talk) 05:14, 22 July 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240722051400","author":"The Earwig","type":"comment","level":3,"id":"c-The_Earwig-20240722051400-The_Earwig-20240720004300","replies":["c-I_dream_of_horses-20240730212000-The_Earwig-20240722051400"],"displayName":"The\u00a0Earwig"}}-->
FYI, I've also run into this issue the last couple of days. I'm assuming you're still working on it, or that life has gotten in the way of you fixing the issue. I dream of horses (Hoofprints) (Neigh at me) 21:20, 30 July 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240730212000","author":"I dream of horses","type":"comment","level":4,"id":"c-I_dream_of_horses-20240730212000-The_Earwig-20240722051400","replies":["c-The_Earwig-20240731002100-I_dream_of_horses-20240730212000"]}}-->
Yes, it's still my current focus with the free time I have. — The Earwig (talk) 00:21, 31 July 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240731002100","author":"The Earwig","type":"comment","level":5,"id":"c-The_Earwig-20240731002100-I_dream_of_horses-20240730212000","replies":["c-Liz-20240808030600-The_Earwig-20240731002100"],"displayName":"The\u00a0Earwig"}}-->
Just circling back to see how you responded to my query last month. Still have not successfully submitted a query and gotten a report in several months now. I realize that we are all volunteers so I don't have high expectations of when this issue might be "fixed" as we all have outside lives.
But I didn't realize though that regular editors were competing with bots, that's a battle individual editors can never win so please block those bots, if possible! I don't even see how a bot would be able to handle a copyright violation report and interpret it appropriately. Liz Read! Talk! 03:06, 8 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240808030600","author":"Liz","type":"comment","level":6,"id":"c-Liz-20240808030600-The_Earwig-20240731002100","replies":["c-I_dream_of_horses-20240825234800-Liz-20240808030600","c-I_dream_of_horses-20240925030900-Liz-20240808030600"]}}-->
To second what @Liz said above, I just tried to run the copyvio tool on a promotional draft, and got the error again. Any progress to report on?
Also, Liz, I think authentication has been added so we aren't competing against bots, at least not as much, perSo, I will be adding authentication to the tool to make sure only logged in users can use it and I can more accurately identify who is overusing it. I dream of horses (Hoofprints) (Neigh at me) 23:48, 25 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240825234800","author":"I dream of horses","type":"comment","level":7,"id":"c-I_dream_of_horses-20240825234800-Liz-20240808030600","replies":[]}}-->
Is there anything other people can do to help with getting the copyvio tool up, or is this something you're going to need to do on your own? I dream of horses (Hoofprints) (Neigh at me) 03:09, 25 September 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240925030900","author":"I dream of horses","type":"comment","level":7,"id":"c-I_dream_of_horses-20240925030900-Liz-20240808030600","replies":["c-The_Earwig-20241005151900-I_dream_of_horses-20240925030900"]}}-->
Hey Liz and I dream of horses. With substantial help from Chlod, we've released a change to require logging in to use the search engine option in the tool. (It uses OAuth, and it should redirect you automatically when running a new check.) This is still new, but it looks like this has eased our usage enough that the tool should not run out of quota so often. — The Earwig (talk) 15:19, 5 October 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20241005151900","author":"The Earwig","type":"comment","level":8,"id":"c-The_Earwig-20241005151900-I_dream_of_horses-20240925030900","replies":["c-I_dream_of_horses-20241005153200-The_Earwig-20241005151900","c-I_dream_of_horses-20241005153300-The_Earwig-20241005151900"],"displayName":"The\u00a0Earwig"}}-->
Great! I dream of horses (Hoofprints) (Neigh at me) 15:32, 5 October 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20241005153200","author":"I dream of horses","type":"comment","level":9,"id":"c-I_dream_of_horses-20241005153200-The_Earwig-20241005151900","replies":[]}}-->
It works! I dream of horses (Hoofprints) (Neigh at me) 15:33, 5 October 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20241005153300","author":"I dream of horses","type":"comment","level":9,"id":"c-I_dream_of_horses-20241005153300-The_Earwig-20241005151900","replies":[]}}-->
__DTSUBSCRIBEBUTTONDESKTOP__{"headingLevel":2,"name":"h-Mathglot-20240806193200","type":"heading","level":0,"id":"h-Earwig_returns_0%_on_url-comparison_with_clever_close_paraphrase-20240806193200","replies":["c-Mathglot-20240806193200-Earwig_returns_0%_on_url-comparison_with_clever_close_paraphrase"],"text":"Earwig returns 0% on url-comparison with clever close paraphrase","linkableTitle":"Earwig returns 0% on url-comparison with clever close paraphrase"}-->

Earwig returns 0% on url-comparison with clever close paraphrase

__DTELLIPSISBUTTON__{"threadItem":{"headingLevel":2,"name":"h-Mathglot-20240806193200","type":"heading","level":0,"id":"h-Earwig_returns_0%_on_url-comparison_with_clever_close_paraphrase-20240806193200","replies":["c-Mathglot-20240806193200-Earwig_returns_0%_on_url-comparison_with_clever_close_paraphrase"]}}-->
__DTSUBSCRIBEBUTTONMOBILE__{"headingLevel":2,"name":"h-Mathglot-20240806193200","type":"heading","level":0,"id":"h-Earwig_returns_0%_on_url-comparison_with_clever_close_paraphrase-20240806193200","replies":["c-Mathglot-20240806193200-Earwig_returns_0%_on_url-comparison_with_clever_close_paraphrase"],"text":"Earwig returns 0% on url-comparison with clever close paraphrase","linkableTitle":"Earwig returns 0% on url-comparison with clever close paraphrase"}-->

Hello. I noticed a {{circular}} tag at Ceteris paribus and ran this URL comparison to find out how much duplication there was, and in what section(s). To my surprise, it came back with 0.0%. However, notice these:

Comparison snippets

From: https://www.masterclass.com/articles/ceteris-paribus-explained#7MlD3BCbNL4NC0BejpGo02

1. Supply chain: Ceteris paribus considers production factors, such as logistics, sourcing, competition, and trends with buyers to determine the price of goods. For example, a bread seller observes the costs of the ingredients, labor, packaging, and distribution, in addition to competitors, economic inflation, and consumer trends. Ceteris paribus stipulates that if other factors remain the same, a decrease in the supply of bread will cause prices to rise.

2. The law of supply and demand: In the law of demand, buyers demand less of an economic good when prices are higher. The law of supply says that sellers will supply more of an economic good when prices are higher. The interaction of these two laws determines the actual market price and volume of goods. Ceteris paribus identifies, isolates, and tests the impact of an independent variable that would affect these two laws and the causal factors in the market supply and prices.

3. Gross domestic product: Economists use ceteris paribus to study the GDP, assuming that variables remain fixed to determine the effect in the money market.

4. Interest rates: If the interest rates increase, the independent variable, then the demand for debt goes down as the cost of borrowing increases, the dependent variable.

5. Minimum wage: Economists use ceteris paribus to determine the potential effects of a minimum wage increase, including the possible outcome of fewer jobs available if companies must pay employees more.


From Ceteris paribus#Applications rev. 1238986793:

The concept of ceteris paribus is crucial for economists and can be applied in researching:

  1. Supply chain. Ceteris paribus considers aspects of production, that being competition in the market, production costs, inflation, and consumer trends to conclude pricing of goods, imposing that keeping the aspects of production constant, minimising supply will adjust prices to increase.[1]
  2. Law of supply and demand. The law of demand states that, when prices rise the demand of goods fall, whilst the law of supply dictates that as prices rise sellers are more willing to supply. When these laws interrelate market prices and supply in the market are determined. Ceteris paribus is used in the law of supply and demand through determining how independent variables will impact the casual factors of prices and supply in the market.[1]
  3. Gross domestic product. Ceteris paribus is used in relation to GDP to determine how the money market will change when variables remain constant.[1]
  4. Interest rates. Through keeping interest rates as the independent variable, as interest rates rise, thus borrowing costs rise forcing a reduction in the demand for debt, that being the dependent variable.[1]
  5. Minimum wage. To define the possible effects of a rise in the minimum wage economists will use ceteris paribus. Possible effects include how wage increases may force employments down.[1]

References

  1. ^ a b c d e "Ceteris Paribus Explained: 5 Economic Uses for Ceteris Paribus". MasterClass. 2021-12-21. Retrieved 2024-06-05.

There is a lot of close paraphrase here, maybe enough to cover their tracks and confuse the detector. I remember glancing at Andrei Broder's shingle-based detection paper eons ago (might be this one) and I don't know how yours works, but if it is shingle-based, would it be feasible to add a new param to the input form, or in the settings, maybe in an 'advanced' section, to set the shingle size? In a case of paraphrase like this one, where the information is clearly copied but words are shifted around in the sentences, a shorter shingle size might do a lot better at detecting the similarities. This might kill processing time in the web search version, so maybe would only work when the 'url' radio button was selected, but still could be pretty useful for cases like that, and might make a great tool for assigning a measurable value to close paraphrase, which afaik we do not have currently, and is all very hand-wavy. Thanks, Mathglot (talk) 19:32, 6 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240806193200","author":"Mathglot","type":"comment","level":1,"id":"c-Mathglot-20240806193200-Earwig_returns_0%_on_url-comparison_with_clever_close_paraphrase","replies":["c-Mathglot-20240806200900-Mathglot-20240806193200","c-Mathglot-20240806215900-Mathglot-20240806193200"]}}-->

It does slightly better (4.8%) specifying revision id 1151114395. What is going on here? Mathglot (talk) 20:09, 6 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240806200900","author":"Mathglot","type":"comment","level":2,"id":"c-Mathglot-20240806200900-Mathglot-20240806193200","replies":[]}}-->
Okay, just noticed that in both of those revisions, Earwig doesn't appear to see past the first short section of the web page, so the paraphrased section I am addressing doesn't appear to be visible to Earwig, or at least, it isn't displaying it on the comparison page, for some reason, if you scroll down. Mathglot (talk) 21:59, 6 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240806215900","author":"Mathglot","type":"comment","level":2,"id":"c-Mathglot-20240806215900-Mathglot-20240806193200","replies":["c-The_Earwig-20240807002300-Mathglot-20240806215900"]}}-->
That's exactly it, Mathglot. The website loads its content through JavaScript so it's not available to the tool. There isn't an easy workaround for this, but there are some options I could try further in the future. Since the content doesn't show up in the comparison view as part of the source, my hope is that people will figure out what's going on, as you were able to. — The Earwig (talk) 00:23, 7 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240807002300","author":"The Earwig","type":"comment","level":3,"id":"c-The_Earwig-20240807002300-Mathglot-20240806215900","replies":["c-Mathglot-20240807011400-The_Earwig-20240807002300"],"displayName":"The\u00a0Earwig"}}-->
Thanks for that. Even if it could see it, I wonder if it would come up with any kind of rating, due to the paraphrase? Not sure what kind of test bed you use, but if you could copy the MasterClass page and save it offline locally (post-js, or just scraping the rendered page manually and saving it) and run Earwig against that file, I'd be interested to see what it would come up with. And if you use shingling and it's parametrizable, whether the rating would change if you reduced the shingle size. Mathglot (talk) 01:14, 7 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240807011400","author":"Mathglot","type":"comment","level":4,"id":"c-Mathglot-20240807011400-The_Earwig-20240807002300","replies":["c-The_Earwig-20240807052000-Mathglot-20240807011400"]}}-->
OK, I can do a quick experiment of that, Mathglot. The tool does use shingling, actually. I haven't seen this paper and independently came up with a similar algorithm many years ago. Internally I call the shingle size the degree, and I've exposed that as a query-string-only parameter if you would like to play with it.
I manually copied the text to a pastebin. With the tool's default shingle size of 5 words, almost no similar text is found, and the similarity score is 5.7%. With size 3, it's 38.3%. With size 2, it's 67.1%. At this point a lot of the similar content is trivial ("is a", "in the", "of the"), so the odds of a false positive are much higher, though it does at least highlight some interesting similarities, too.
The tool doesn't have a way of identifying more unique common phrases. If we could down-weigh "is a" but up-weigh, say, "wage economists", we could lower the default shingle size and get more sensitive results. The default size was actually 3 several years ago, but I raised it because the false positive rate was just a bit too high and it was causing confusion. So there's a delicate balancing act with the current algorithm.
Food for thought. Thanks. — The Earwig (talk) 05:20, 7 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240807052000","author":"The Earwig","type":"comment","level":5,"id":"c-The_Earwig-20240807052000-Mathglot-20240807011400","replies":["c-Mathglot-20240813040500-The_Earwig-20240807052000"],"displayName":"The\u00a0Earwig"}}-->
Oh, that's very thought-provoking, thanks! You could start with a stop-word list, and eliminate those, and there may be lists of bigrams containing stop words. I searched /most common bi-grams with stop words in English/ and repeatedly ran into "tidytext in R", and "NLTK in Python"; also articles like 1, 2. As far as how to down-weigh and up-weigh, TF-IDF is one very standard solution, which works better on a larger corpus or bag of words, which you could accumulate yourself, by just dumping all of the words of each document you come across into a list, and counting later, maybe once a week or month, and recalculating the frequencies, but my understanding is that there is a budget available for Earwig (for the Google API) and it's likely that there is a term frequency list out there somewhere for English, and we could just buy it. (You would only have to do that once in theory, although language does evolve, so maybe once a year?) Then you wouldn't have to build your own bag of words. Your experiment looks really interesting, and I wonder if any of these other ideas would kick it up a level. Mathglot (talk) 04:05, 13 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240813040500","author":"Mathglot","type":"comment","level":6,"id":"c-Mathglot-20240813040500-The_Earwig-20240807052000","replies":["c-The_Earwig-20240813132200-Mathglot-20240813040500"]}}-->
This is helpful. Thanks! — The Earwig (talk) 13:22, 13 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240813132200","author":"The Earwig","type":"comment","level":7,"id":"c-The_Earwig-20240813132200-Mathglot-20240813040500","replies":[],"displayName":"The\u00a0Earwig"}}-->
__DTSUBSCRIBEBUTTONDESKTOP__{"headingLevel":2,"name":"h-MediaWiki_message_delivery-20240814224900","type":"heading","level":0,"id":"h-The_Signpost:_14_August_2024-20240814224900","replies":["c-MediaWiki_message_delivery-20240814224900-The_Signpost:_14_August_2024"],"text":"The Signpost: 14 August 2024","linkableTitle":"The Signpost: 14 August 2024"}-->

The Signpost: 14 August 2024

__DTELLIPSISBUTTON__{"threadItem":{"headingLevel":2,"name":"h-MediaWiki_message_delivery-20240814224900","type":"heading","level":0,"id":"h-The_Signpost:_14_August_2024-20240814224900","replies":["c-MediaWiki_message_delivery-20240814224900-The_Signpost:_14_August_2024"]}}-->
__DTSUBSCRIBEBUTTONMOBILE__{"headingLevel":2,"name":"h-MediaWiki_message_delivery-20240814224900","type":"heading","level":0,"id":"h-The_Signpost:_14_August_2024-20240814224900","replies":["c-MediaWiki_message_delivery-20240814224900-The_Signpost:_14_August_2024"],"text":"The Signpost: 14 August 2024","linkableTitle":"The Signpost: 14 August 2024"}-->
* Read this Signpost in full * Single-page * Unsubscribe * MediaWiki message delivery (talk) 22:49, 14 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240814224900","author":"MediaWiki message delivery","type":"comment","level":1,"id":"c-MediaWiki_message_delivery-20240814224900-The_Signpost:_14_August_2024","replies":[]}}-->
__DTSUBSCRIBEBUTTONDESKTOP__{"headingLevel":2,"name":"h-Novem_Linguae-20240821125000","type":"heading","level":0,"id":"h-EarwigBot_might_be_down-20240821125000","replies":["c-Novem_Linguae-20240821125000-EarwigBot_might_be_down"],"text":"EarwigBot might be down","linkableTitle":"EarwigBot might be down"}-->

EarwigBot might be down

__DTELLIPSISBUTTON__{"threadItem":{"headingLevel":2,"name":"h-Novem_Linguae-20240821125000","type":"heading","level":0,"id":"h-EarwigBot_might_be_down-20240821125000","replies":["c-Novem_Linguae-20240821125000-EarwigBot_might_be_down"]}}-->
__DTSUBSCRIBEBUTTONMOBILE__{"headingLevel":2,"name":"h-Novem_Linguae-20240821125000","type":"heading","level":0,"id":"h-EarwigBot_might_be_down-20240821125000","replies":["c-Novem_Linguae-20240821125000-EarwigBot_might_be_down"],"text":"EarwigBot might be down","linkableTitle":"EarwigBot might be down"}-->

Hello friend. EarwigBot hasn't edited since August 17. I believe it has some daily tasks such as Wikipedia:Bots/Requests for approval/EarwigBot 3, so this is abnormal, right? It might need a nudge :) –Novem Linguae (talk) 12:50, 21 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240821125000","author":"Novem Linguae","type":"comment","level":1,"id":"c-Novem_Linguae-20240821125000-EarwigBot_might_be_down","replies":["c-The_Earwig-20240821133900-Novem_Linguae-20240821125000"]}}-->

Thanks for the ping! The task was active but had gotten stuck somehow. I've restarted it. — The Earwig (talk) 13:39, 21 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240821133900","author":"The Earwig","type":"comment","level":2,"id":"c-The_Earwig-20240821133900-Novem_Linguae-20240821125000","replies":["c-Novem_Linguae-20240821182300-The_Earwig-20240821133900"],"displayName":"The\u00a0Earwig"}}-->
Thanks! I went ahead and boldly signed you up for a bot to alert you if it goes down again. Diff. If undesired, feel free to revert. –Novem Linguae (talk) 18:23, 21 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240821182300","author":"Novem Linguae","type":"comment","level":3,"id":"c-Novem_Linguae-20240821182300-The_Earwig-20240821133900","replies":["c-The_Earwig-20240822071800-Novem_Linguae-20240821182300"]}}-->
Much obliged. — The Earwig (talk) 07:18, 22 August 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240822071800","author":"The Earwig","type":"comment","level":4,"id":"c-The_Earwig-20240822071800-Novem_Linguae-20240821182300","replies":[],"displayName":"The\u00a0Earwig"}}-->
__DTSUBSCRIBEBUTTONDESKTOP__{"headingLevel":2,"name":"h-MediaWiki_message_delivery-20240902184500","type":"heading","level":0,"id":"h-Administrators'_newsletter_\u2013_September_2024-20240902184500","replies":["c-MediaWiki_message_delivery-20240902184500-Administrators'_newsletter_\u2013_September_2024"],"text":"Administrators' newsletter \u2013 September 2024","linkableTitle":"Administrators' newsletter \u2013 September 2024"}-->

Administrators' newsletter – September 2024

__DTELLIPSISBUTTON__{"threadItem":{"headingLevel":2,"name":"h-MediaWiki_message_delivery-20240902184500","type":"heading","level":0,"id":"h-Administrators'_newsletter_\u2013_September_2024-20240902184500","replies":["c-MediaWiki_message_delivery-20240902184500-Administrators'_newsletter_\u2013_September_2024"]}}-->
__DTSUBSCRIBEBUTTONMOBILE__{"headingLevel":2,"name":"h-MediaWiki_message_delivery-20240902184500","type":"heading","level":0,"id":"h-Administrators'_newsletter_\u2013_September_2024-20240902184500","replies":["c-MediaWiki_message_delivery-20240902184500-Administrators'_newsletter_\u2013_September_2024"],"text":"Administrators' newsletter \u2013 September 2024","linkableTitle":"Administrators' newsletter \u2013 September 2024"}-->

News and updates for administrators from the past month (August 2024).

Administrator changes

removed Pppery

Interface administrator changes

removed Pppery

Oversighter changes

removed Wugapodes

CheckUser changes

removed

Guideline and policy news

Arbitration

Miscellaneous


Sent by MediaWiki message delivery (talk) 18:45, 2 September 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20240902184500","author":"MediaWiki message delivery","type":"comment","level":1,"id":"c-MediaWiki_message_delivery-20240902184500-Administrators'_newsletter_\u2013_September_2024","replies":[]}}-->
__DTSUBSCRIBEBUTTONDESKTOP__{"headingLevel":2,"name":"h-MediaWiki_message_delivery-20240904132900","type":"heading","level":0,"id":"h-The_Signpost:_4_September_2024-20240904132900","replies":["c-MediaWiki_message_delivery-20240904132900-The_Signpost:_4_September_2024"],"text":"The Signpost: 4 September 2024","linkableTitle":"The Signpost: 4 September 2024"}-->

The Signpost: 4 September 2024

__DTELLIPSISBUTTON__{"threadItem":{"headingLevel":2,"name":"h-MediaWiki_message_delivery-20240904132900","type":"heading","level":0,"id":"h-The_Signpost:_4_September_2024-20240904132900","replies":["c-MediaWiki_message_delivery-20240904132900-The_Signpost:_4_September_2024"]}}-->
__DTSUBSCRIBEBUTTONMOBILE__{"headingLevel":2,"name":"h-MediaWiki_message_delivery-20240904132900","type":"heading","level":0,"id":"h-The_Signpost:_4_September_2024-20240904132900","replies":["c-MediaWiki_message_delivery-20240904132900-The_Signpost:_4_September_2024"],"text":"The Signpost: 4 September 2024","linkableTitle":"The Signpost: 4 September 2024"}-->
__DTSUBSCRIBEBUTTONDESKTOP__{"headingLevel":2,"name":"h-MediaWiki_message_delivery-20240926201200","type":"heading","level":0,"id":"h-The_Signpost:_26_September_2024-20240926201200","replies":["c-MediaWiki_message_delivery-20240926201200-The_Signpost:_26_September_2024"],"text":"The Signpost: 26 September 2024","linkableTitle":"The Signpost: 26 September 2024"}-->

The Signpost: 26 September 2024

__DTELLIPSISBUTTON__{"threadItem":{"headingLevel":2,"name":"h-MediaWiki_message_delivery-20240926201200","type":"heading","level":0,"id":"h-The_Signpost:_26_September_2024-20240926201200","replies":["c-MediaWiki_message_delivery-20240926201200-The_Signpost:_26_September_2024"]}}-->
__DTSUBSCRIBEBUTTONMOBILE__{"headingLevel":2,"name":"h-MediaWiki_message_delivery-20240926201200","type":"heading","level":0,"id":"h-The_Signpost:_26_September_2024-20240926201200","replies":["c-MediaWiki_message_delivery-20240926201200-The_Signpost:_26_September_2024"],"text":"The Signpost: 26 September 2024","linkableTitle":"The Signpost: 26 September 2024"}-->
__DTSUBSCRIBEBUTTONDESKTOP__{"headingLevel":2,"name":"h-MediaWiki_message_delivery-20241002160100","type":"heading","level":0,"id":"h-Administrators'_newsletter_\u2013_October_2024-20241002160100","replies":["c-MediaWiki_message_delivery-20241002160100-Administrators'_newsletter_\u2013_October_2024"],"text":"Administrators' newsletter \u2013 October 2024","linkableTitle":"Administrators' newsletter \u2013 October 2024"}-->

Administrators' newsletter – October 2024

__DTELLIPSISBUTTON__{"threadItem":{"headingLevel":2,"name":"h-MediaWiki_message_delivery-20241002160100","type":"heading","level":0,"id":"h-Administrators'_newsletter_\u2013_October_2024-20241002160100","replies":["c-MediaWiki_message_delivery-20241002160100-Administrators'_newsletter_\u2013_October_2024"]}}-->
__DTSUBSCRIBEBUTTONMOBILE__{"headingLevel":2,"name":"h-MediaWiki_message_delivery-20241002160100","type":"heading","level":0,"id":"h-Administrators'_newsletter_\u2013_October_2024-20241002160100","replies":["c-MediaWiki_message_delivery-20241002160100-Administrators'_newsletter_\u2013_October_2024"],"text":"Administrators' newsletter \u2013 October 2024","linkableTitle":"Administrators' newsletter \u2013 October 2024"}-->

News and updates for administrators from the past month (September 2024).

Administrator changes

added
removed

CheckUser changes

readded
removed

Guideline and policy news

Arbitration

Miscellaneous


Sent by MediaWiki message delivery (talk) 16:01, 2 October 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20241002160100","author":"MediaWiki message delivery","type":"comment","level":1,"id":"c-MediaWiki_message_delivery-20241002160100-Administrators'_newsletter_\u2013_October_2024","replies":[]}}-->
__DTSUBSCRIBEBUTTONDESKTOP__{"headingLevel":2,"name":"h-Diannaa-20241006203200","type":"heading","level":0,"id":"h-Error_message_on_Pablo_Escobar-20241006203200","replies":["c-Diannaa-20241006203200-Error_message_on_Pablo_Escobar"],"text":"Error message on Pablo Escobar","linkableTitle":"Error message on Pablo Escobar"}-->

Error message on Pablo Escobar

__DTELLIPSISBUTTON__{"threadItem":{"headingLevel":2,"name":"h-Diannaa-20241006203200","type":"heading","level":0,"id":"h-Error_message_on_Pablo_Escobar-20241006203200","replies":["c-Diannaa-20241006203200-Error_message_on_Pablo_Escobar"]}}-->
__DTSUBSCRIBEBUTTONMOBILE__{"headingLevel":2,"name":"h-Diannaa-20241006203200","type":"heading","level":0,"id":"h-Error_message_on_Pablo_Escobar-20241006203200","replies":["c-Diannaa-20241006203200-Error_message_on_Pablo_Escobar"],"text":"Error message on Pablo Escobar","linkableTitle":"Error message on Pablo Escobar"}-->

Hello Ben, I have a weird error to report: when I perform a copyvio search on Pablo Escobar I get an error message "Access to copyvios.toolforge.org was denied, You don't have authorisation to view this page. HTTP ERROR 403". It doesn't matter what source url I try to compate it against. However if I try to compare using a specific revision ID of that article, it works okay. It's only occurred on Pablo Escobar (at least so far). Thought you might like to know. — Diannaa (talk) 20:32, 6 October 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20241006203200","author":"Diannaa","type":"comment","level":1,"id":"c-Diannaa-20241006203200-Error_message_on_Pablo_Escobar","replies":["c-The_Earwig-20241006203700-Diannaa-20241006203200"]}}-->

Hey Diannaa, we had an unusual issue a while back where some bots/crawlers kept running checks against that page so I disabled it. As you noticed, the revision ID should still work. I’ll check if the bots are still hitting it and re-enable if not. — The Earwig alt (talk) 20:37, 6 October 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20241006203700","author":"The Earwig","type":"comment","level":2,"id":"c-The_Earwig-20241006203700-Diannaa-20241006203200","replies":["c-Diannaa-20241006203900-The_Earwig-20241006203700"]}}-->
Ok cool, no problem though if you have to leave it, as there's a simple workaround - using the revision ID number. — Diannaa (talk) 20:39, 6 October 2024 (UTC)[reply]__DTELLIPSISBUTTON__{"threadItem":{"timestamp":"20241006203900","author":"Diannaa","type":"comment","level":3,"id":"c-Diannaa-20241006203900-The_Earwig-20241006203700","replies":[]}}-->
Kembali kehalaman sebelumnya