Macrogrants/Wikimedia Commons Geograph and Avionics batch upload projects support
- Objective
This application is for supporting funding for two of the largest non-GLAM projects on Wikimedia Commons. These projects are entirely driven by unpaid volunteers and have a track record of delivering huge amounts of valued content for Wikipedia in many languages.
- Goals
- Deliver 100,000 uploaded and categorized quality amateur Avionics photographs from a selection of forums, with releases from the photographers on record. Project page: Commons:Commons:Batch uploading/Airliners.
- Geographic categorization of UK Geograph images (currently just under 2 million on Commons) and refresh the collection with those now available in higher resolutions and update with additional photographs (a total current collection of 3.7 million photographs). Project page: Commons:User:Faebot/Geograph.
- Resources
- Volunteer resources. These are long running projects spanning years rather than months and requiring regular maintenance tasks when complete. Highly reliant on volunteer time the plan needs to be flexible but firm announcements about the projects would be available for Wikimania 2014 and the main deliverables would complete by the end of 2014. The principle contributors in 2013 for the avionics project have been Russavia and Fæ with wide support from more than 150 other contributors. Time from principle contributors has been of the order of more than 10 hours per week. The Geograph project took a lot of development and test time in 2012 but is now less demanding, regular set-up and maintenance is of the order of 10 hours per month of Fæ's time.
- Communications and hardware. The bandwidth costs have been high. A key reliability issue has been Fæ's internet connection and an inability to do any batch image processing apart from simple cropping (using a part time old notebook as a Windows installation). Video processing requests (such as conversions to OGV) have been rejected in the past due to this lack of power rather than lack of volunteer time.
- The primary machine for this work is a maxed-out 2009 Macmini running OSX 10.5. This means that Python-scripted image processing is limited or impossible. It is proposed that a devoted machine running OS X Mountain Lion is purchased specifically to support Faebot's activities (currently the most active bot on Commons with a track record of over 2.2 million edits), this will provide much needed disk capacity and open potential for audio and video processing as well as supporting more complex image processing and identification issues on batch uploaded images. Current price is £499[1] (standard John Lewis price with 2 year warranty)
- Bandwidth use has been high (capped at 40GB/month). An upgrade to a higher bandwidth service cost an extra £10/month, it is proposed that half of the years' costs are covered by a one-off grant in 2014 to support Faebot's batch processing related activities (£60).
- WMUK previously agreed to pay for a £15 memory stick to reduce the likelyhood of hard-drive damage, though this has yet to be claimed for. Considering a 32GB stick will probably not be sufficient to cope with a full xml dump from Commons in 2014 (needed for the Geograph project), it is proposed that a 64GB usb stick is purchased with expected costs around £30.
- Staff resources. None.
- Expenses. Limited to postage, perhaps £10, no travel is expected.
- Access costs. An obstacle to uploading some sets of restricted images (but where a licence release is on record in OTRS) has been that we require membership for some of the forums. The membership cost for Airliners.net (the main resource so far) is $55 for a year, there are options for taking 3 month or 6 month memberships that may be suitable. Around 9 forums are on our target list and a general membership budget of up to £100 may be sufficient as and when these purchases will have the most benefit.
The total budget to support the above is estimated at £650.
- Constraints
None.
These projects are noted for being both engaging for "gnomic" volunteers and independent of the WMF or WMF managed tools. This probably remains desirable even if promotion of the outcomes (the images then available for reuse on all projects) may appeal to "front-end" volunteers, the methods could be popular to present at events such as Wikimania or for more focussed workshops on how to manage large batch projects on Commons and in the longer term there may be regular maintenance or housekeeping bot tasks that could transition to WMF servers.
- Outcomes
- 100,000 Avionics photographs checked and categorized will provide an independent and non-commercial world standard reference base for aircraft of all model types in all airline liveries. A consequence will be a consistent standard for using ICAO codes on all Airport categories, along with their geo-coordinates and photographs of the majority of airports, military air bases and air fields in all countries.
- Consensus for the project methods of automatic Geo-categorization of the millions of photographs in the Geograph collection. This has currently been limited to UK County/Authority level due to the doubts about accuracy and a lack of standardization for naming lower level categories such as villages. An automated link using WikiData may be possible in 2014, though this will also require cross-project consensus. This has been an issue without firm consensus for several years.
- Throughout 2014 a series of published tests, case studies and on-Commons guidelines for:
- Best practices for using Ordnance Survey Open Data to categorize images on Commons by location.
- Python and Pywikipediabot techniques for identifying and removing standard watermarks and credit bars from batch uploads. This may include the use of SciPy or similar open source tools to analyse and correct images.
- Identification of near duplicates and copyright problems through image matching.
- Using EXIF data for improved categorization and finding suspect images.
- Ad-hoc outcomes from supporting Faebot. A track record of successful batch upload and bot projects tends to attract suggestions from the Commons Community. As an example User:odder proposed the creation of Commons:User:Noaabot to maintain daily USA weather maps on Commons (which is supported using the same machine as Faebot) and Llywelyn2000 proposed a (Fair Use & PD) Welsh book covers project with a maintained project dashboard on the Welsh wikipedia cy:Wicipedia:Wicibrosiect_Llyfrau_Gwales/dangosfwrdd; these active projects rely on Faebot being available every day of the year.
- Risks
- This project is highly reliant on Fæ's time and Russavia's expertise for avionics. Project pages such as Commons:Batch_uploading/Airliners actively encourage participation and it may be possible to get another bot operator interested in the relevant maintenance scripts that Faebot relies on. However there are no hard deadlines, so temporary illness or unavailability of a volunteer should not affect the long term outcome.
- Discussion
- Would this be better done by an online server (or Wikimedia Labs?) rather than a computer on a home broadband connection? Thanks. Mike Peel (talk) 16:14, 15 November 2013 (UTC)
- I agree that highly stable daily routines like the non-image processing parts of Noaabot, or the weekly upload from the MOD (with on-going positive cooperation from the Ministry of Defence Imagery Team), could benefit by moving to WMF servers (both Faebot and Noaabot have been recently set up at tools.wmflabs.org). Remember these are not intended to be external facing tools, or additions to MediaWiki. Investigating this and making it happen was something I was planning on looking at in 2014, guessing it will take a month or two to pick away at it, having never done it before. There being little training for any of this, Wikimedia documentation and manuals intended for volunteers remain in a hard to use state, which means I would have to commit quite a proportion of spare time to auto-didactically working it out, diverting from pragmatic delivery. Farming stable pieces of Faebot out to the WMF servers is an expected outcome in 2014, though the successful track record of doing the creative development parts of these projects at the client end says a lot for keeping them a hackish local set-up.
- Case: Image processing and experiments: There are benefits to being able to process images and run experiments locally (including tweaking and testing code even when my connection is unreliable, as has happened many times in the past, currently this bug caused by WMF server problems has not stopped Faebot developing). Much of Faebot's activities remain rather ad-hoc and adapt as problems arise, it is the ease of adaptation that has resulted in Faebot becoming the most active bot on Commons. In fact, in terms of image processing, apart from running SciPy based transforms, I find it hard to imagine resolving the embossed watermarks issue in a virtual environment. The process I use for cropping ended up a mix between Windows tools and OSX tools, with other image problems even relying on work-arounds using Photoshop macros and past attempts at video re-compression using a workflow piping the files to external sites and redoing in a local VLC setup. Any of these sorts of fixes would be made so complex going via a virtual environment (or dubious if needing to install odd one-off tools in the remove environment) that I would give up at a planning stage before experimentation started.
- Case: Avionics mappings: For the Avionics project each forum is a separate configuration using BeautifulSoup to carefully capture the image metadata, each author has their own template which I have had to fiddle around with to get the mapping right, the airports list extends as we go along, which means Faebot halts, tells me there is a mapping needed and I hand update the category map (if I can find one, otherwise I make a human decision on whether to create it or leave it for the project team) along with my best guess for string matching which may have value for other forums. If we integrate the complex process of detecting and removing embossed watermarks this would be highly experimental, I don't see any value of pushing this out to a remote server.
- Case: Geograph: For Geograph things are more stable with runs for regions (like Wales) taking a month or two to complete, but again I would separate out the uniquely mapped batch uploads, along with their heuristic category mappings, from what might become maintenance tasks for Geograph longer term. The potential next stage of mapping to sub-County level will require significant small-run testing and experimentation, again best done locally where suck-it-and-see along with a lot of invisible passive tests is an existing highly successful approach. Part of the project will be the end transition to maintenance which may well be either on WMF servers, or run elsewhere but with the code published centrally to ensure operator handover can work should the worst happen.
- --Fæ (talk) 01:11, 18 November 2013 (UTC)
- I don't know about 'an online server (or Wikimedia Labs?)', Mike, but the work Fae's doing uploading book covers on cy is superb! Robin Owain (WMUK) (talk) 01:23, 21 November 2013 (UTC)