Spell Checker Using Pyspellchecker Package

This story explains, what is pyspellchecker, and how to use it.

Kamal khumar
The Startup

--

What is pyspellchecker?

The pyspellchecker is an open-source package that allows you to correct spelling, as well as see candidate spellings for a misspelled word.

To install the package, you can use pip:

pip install pyspellchecker

First, we import the necessary packages,

Once installed, the pyspellchecker is straightforward to use. Note that even though we use “pyspellchecker” when installing via pip, we just type “spellchecker” in the package import statement.

Fig 1: Import Statements

And to view all the available directories, dir method can be used.

Fig 2: Invoking dir method

And the output is all available directories.

[‘_SpellChecker__edit_distance_alt’, ‘__class__’, ‘__contains__’, ‘__delattr__’, ‘__dir__’, ‘__doc__’, ‘__eq__’, ‘__format__’, ‘__ge__’, ‘__getattribute__’, ‘__getitem__’, ‘__gt__’, ‘__hash__’, ‘__init__’, ‘__init_subclass__’, ‘__le__’, ‘__lt__’, ‘__module__’, ‘__ne__’, ‘__new__’, ‘__reduce__’, ‘__reduce_ex__’, ‘__repr__’, ‘__setattr__’, ‘__sizeof__’, ‘__slots__’, ‘__str__’, ‘__subclasshook__’, ‘_case_sensitive’, ‘_check_if_should_check’, ‘_distance’, ‘_tokenizer’, ‘_word_frequency’, ‘candidates’, ‘correction’, ‘distance’, ‘edit_distance_1’, ‘edit_distance_2’, ‘export’, ‘known’, ‘split_words’, ‘unknown’, ‘word_frequency’, ‘word_probability’]

The next piece is to create a SpellChecker object, which we’ll term as “spell”.

Fig 3: Creating an object

And the text we are about to handle is immigrants in Toronto and the string is stored in the variable docx.

The string is,

People have travelled through and inhabited the Toronto area, located on a broad sloping plateau interspersed with rivers, deep ravines, and urban forest, for more than 10,000 years. After the broadly disputed Torronto Purchase, when the Mississauga surrendered the area to the British Crown,the British established the town of York in 1793 and later designeted it as the capital of Upper Canada. During the War of 1812, the town was the site of the Battle of York and suffered heavy damage by American troops. York was renamed and incorporated in 1834 as the city of Toronto. It was designated as the capitel of the province of Ontario in 1867 during Canadian Confederation. The city proper has since expanded past its original borders through both annexation and amalgamation to its current area of 630.2 km2 (243.3 sq mi). The diverse population of Tornto reflects its current and historical role as an important destination for immigrants to Canada. More than 50 percent of residants belong to a visible minority population group, and over 200 distinct ethnic origins are represented among its inhabitats. While the majority of Torontonians speak English as their premary language, over 160 languages are spoken in the city. Toront is a prominent center for music, theatre, motion picture production, and tilevision production, and is home to the headquarters of Canada’s major notional broadcast networks and media outlets. Its varied caltural institutions, which include numerous museums and gelleries, festivals and public events, entertaiment districts, national historic sites, and sports actevities, attract over 43 million touriets each year. Torunto is known for its many skysvrapers and high-rise buildinds, in particalar the tallest free-standind structure in the Western Hemisphere, the CN Tower.

And all the text tokens present in the docx is captured using re.

Fig 4: Invoking re

The result is,

[‘People’, ‘have’, ‘travelled’, ‘through’, ‘and’, ‘inhabited’, ‘the’, ‘Toronto’, ‘area’, ‘located’, ‘on’, ‘a’, ‘broad’, ‘sloping’, ‘plateau’, ‘interspersed’, ‘with’, ‘rivers’, ‘deep’, ‘ravines’, ‘and’, ‘urban’, ‘forest’, ‘for’, ‘more’, ‘than’, ‘years’, ‘After’, ‘the’, ‘broadly’, ‘disputed’, ‘Torronto’, ‘Purchase’, ‘when’, ‘the’, ‘Mississauga’, ‘surrendered’, ‘the’, ‘area’, ‘to’, ‘the’, ‘British’, ‘Crown’, ‘the’, ‘British’, ‘established’, ‘the’, ‘town’, ‘of’, ‘York’, ‘in’, ‘and’, ‘later’, ‘designeted’, ‘it’, ‘as’, ‘the’, ‘capital’, ‘of’, ‘Upper’, ‘Canada’, ‘During’, ‘the’, ‘War’, ‘of’, ‘the’, ‘town’, ‘was’, ‘the’, ‘site’, ‘of’, ‘the’, ‘Battle’, ‘of’, ‘York’, ‘and’, ‘suffered’, ‘heavy’, ‘damage’, ‘by’, ‘American’, ‘troops’, ‘York’, ‘was’, ‘renamed’, ‘and’, ‘incorporated’, ‘in’, ‘as’, ‘the’, ‘city’, ‘of’, ‘Toronto’, ‘It’, ‘was’, ‘designated’, ‘as’, ‘the’, ‘capitel’, ‘of’, ‘the’, ‘province’, ‘of’, ‘Ontario’, ‘in’, ‘during’, ‘Canadian’, ‘Confederation’, ‘The’, ‘city’, ‘proper’, ‘has’, ‘since’, ‘expanded’, ‘past’, ‘its’, ‘original’, ‘borders’, ‘through’, ‘both’, ‘annexation’, ‘and’, ‘amalgamation’, ‘to’, ‘its’, ‘current’, ‘area’, ‘of’, ‘km’, ‘sq’, ‘mi’, ‘The’, ‘diverse’, ‘population’, ‘of’, ‘Tornto’, ‘reflects’, ‘its’, ‘current’, ‘and’, ‘historical’, ‘role’, ‘as’, ‘an’, ‘important’, ‘destination’, ‘for’, ‘immigrants’, ‘to’, ‘Canada’, ‘More’, ‘than’, ‘percent’, ‘of’, ‘residants’, ‘belong’, ‘to’, ‘a’, ‘visible’, ‘minority’, ‘population’, ‘group’, ‘and’, ‘over’, ‘distinct’, ‘ethnic’, ‘origins’, ‘are’, ‘represented’, ‘among’, ‘its’, ‘inhabitats’, ‘While’, ‘the’, ‘majority’, ‘of’, ‘Torontonians’, ‘speak’, ‘English’, ‘as’, ‘their’, ‘premary’, ‘language’, ‘over’, ‘languages’, ‘are’, ‘spoken’, ‘in’, ‘the’, ‘city’, ‘Toront’, ‘is’, ‘a’, ‘prominent’, ‘center’, ‘for’, ‘music’, ‘theatre’, ‘motion’, ‘picture’, ‘production’, ‘and’, ‘tilevision’, ‘production’, ‘and’, ‘is’, ‘home’, ‘to’, ‘the’, ‘headquarters’, ‘of’, ‘Canada’, ‘s’, ‘major’, ‘notional’, ‘broadcast’, ‘networks’, ‘and’, ‘media’, ‘outlets’, ‘Its’, ‘varied’, ‘caltural’, ‘institutions’, ‘which’, ‘include’, ‘numerous’, ‘museums’, ‘and’, ‘gelleries’, ‘festivals’, ‘and’, ‘public’, ‘events’, ‘entertaiment’, ‘districts’, ‘national’, ‘historic’, ‘sites’, ‘and’, ‘sports’, ‘actevities’, ‘attract’, ‘over’, ‘million’, ‘touriets’, ‘each’, ‘year’, ‘Torunto’, ‘is’, ‘known’, ‘for’, ‘its’, ‘many’, ‘skysvrapers’, ‘and’, ‘high’, ‘rise’, ‘buildinds’, ‘in’, ‘particalar’, ‘the’, ‘tallest’, ‘free’, ‘standind’, ‘structure’, ‘in’, ‘the’, ‘Western’, ‘Hemisphere’, ‘the’, ‘CN’, ‘Tower’]

The next step is to find all the misspelled words and it can be done by passing the string docx into the “unknown” function.

Fig 5: Invoking the unknown function

And this “misspelled” list contains all the misspelled words in the string.

Finally, the “misspelled” list is parsed and each token from the list is passed to the function “correction” to get the correct spelling of the token and “candidates” function is used to avail all possible suggestions of that token.

Fig 6: parsing misspelt words

The output of this is,

tilevision → television {‘television’}
gelleries → galleries {‘galleries’}
particalar → particular {‘particular’}
skysvrapers → skyscrapers {‘skyscrapers’}
torontonians → torontonians {‘torontonians’}
residants → residents {‘residents’}
caltural → cultural {‘cultural’}
mississauga → mississauga {‘mississauga’}
inhabitats → inhabitants {‘inhabitants’}
standind → standing {“standin’”, ‘standing’}
actevities → activities {‘activities’}
buildinds → buildings {‘buildings’}
capitel → capital {‘capital’, ‘capitol’}
designeted → designated {‘designated’}
entertaiment → entertainment {‘entertainment’}
toront → toronto {‘toronto’}
torronto → toronto {‘toronto’}
touriets → tourists {‘tourists’}
torunto → toronto {‘toronto’}
tornto → toronto {‘toronto’, ‘tonto’}
premary → primary {‘primary’}

The above-quoted program is demonstrated subjective to the English language whereas the same can be implemented by following the above methodology for various languages like Spanish, German, French, and Portuguese.

Briefing on this statement, the following presents a demo to extract the output in French.

Fig 7: Demo code in French

The output is,

resaissir → ressaisir {‘ressaisir’}
plonbier → plombier {‘plombier’}
matinnée → matinée {‘matinée’}
tecnicien → technicien {‘technicien’}

The full code is available in GitHub

--

--