How to Read Newline in .txt File
How to extract specific portions of a text file using Python
Updated: 06/thirty/2020 past Computer Promise
Extracting text from a file is a common chore in scripting and programming, and Python makes it like shooting fish in a barrel. In this guide, we'll discuss some elementary ways to extract text from a file using the Python three programming language.
Make sure you're using Python three
In this guide, nosotros'll be using Python version 3. Almost systems come pre-installed with Python ii.7. While Python 2.7 is used in legacy code, Python three is the present and hereafter of the Python linguistic communication. Unless you lot have a specific reason to write or support Python two, we recommend working in Python three.
For Microsoft Windows, Python 3 can be downloaded from the Python official website. When installing, make sure the "Install launcher for all users" and "Add Python to PATH" options are both checked, as shown in the image below.
On Linux, you can install Python three with your packet director. For case, on Debian or Ubuntu, you can install it with the following command:
sudo apt-go update && sudo apt-get install python3
For macOS, the Python 3 installer tin can be downloaded from python.org, equally linked above. If yous are using the Homebrew package manager, it tin also exist installed past opening a last window (Applications → Utilities), and running this control:
brew install python3
Running Python
On Linux and macOS, the command to run the Python 3 interpreter is python3. On Windows, if you installed the launcher, the control is py. The commands on this page use python3; if you're on Windows, substitute py for python3 in all commands.
Running Python with no options starts the interactive interpreter. For more information about using the interpreter, see Python overview: using the Python interpreter. If yous accidentally enter the interpreter, you can go out information technology using the command exit() or quit().
Running Python with a file proper noun will interpret that python program. For case:
python3 program.py
...runs the program independent in the file plan.py.
Okay, how can we use Python to excerpt text from a text file?
Reading data from a text file
Start, permit's read a text file. Let's say we're working with a file named lorem.txt, which contains lines from the Lorem Ipsum instance text.
Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Note
In all the examples that follow, we work with the four lines of text contained in this file. Copy and paste the latin text above into a text file, and save it as lorem.txt, so you can run the example code using this file as input.
A Python programme tin can read a text file using the congenital-in open() function. For example, the Python iii plan below opens lorem.txt for reading in text mode, reads the contents into a cord variable named contents, closes the file, and prints the data.
myfile = open("lorem.txt", "rt") # open lorem.txt for reading text contents = myfile.read() # read the unabridged file to string myfile.shut() # close the file print(contents) # print string contents
Here, myfile is the name we give to our file object.
The "rt" parameter in the open() function means "we're opening this file to read text data"
The hash marking ("#") means that everything on that line is a comment, and information technology's ignored by the Python interpreter.
If you save this programme in a file chosen read.py, y'all can run it with the post-obit command.
python3 read.py
The command higher up outputs the contents of lorem.txt:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit down amet pretium tellus. Quisque at dignissim lacus.
Using "with open"
Information technology's of import to close your open files as soon as possible: open up the file, perform your operation, and close it. Don't go out it open up for extended periods of time.
When you lot're working with files, it's good exercise to apply the with open...as chemical compound statement. It's the cleanest way to open a file, operate on it, and shut the file, all in one easy-to-read block of lawmaking. The file is automatically airtight when the lawmaking cake completes.
Using with open up...as, we tin rewrite our program to look like this:
with open up ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text contents = myfile.read() # Read the entire file to a cord print(contents) # Print the string
Notation
Indentation is important in Python. Python programs use white space at the beginning of a line to define scope, such as a cake of lawmaking. We recommend you use four spaces per level of indentation, and that y'all apply spaces rather than tabs. In the following examples, brand sure your lawmaking is indented exactly as it's presented hither.
Example
Salve the plan as read.py and execute it:
python3 read.py
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Reading text files line-by-line
In the examples so far, nosotros've been reading in the whole file at once. Reading a full file is no big deal with minor files, but generally speaking, it's not a great thought. For one thing, if your file is bigger than the amount of bachelor memory, you'll run across an error.
In almost every case, it's a improve idea to read a text file one line at a time.
In Python, the file object is an iterator. An iterator is a type of Python object which behaves in certain ways when operated on repeatedly. For instance, yous can use a for loop to operate on a file object repeatedly, and each time the same functioning is performed, you'll receive a dissimilar, or "next," result.
Example
For text files, the file object iterates ane line of text at a time. It considers 1 line of text a "unit of measurement" of data, then we can use a for...in loop statement to iterate one line at a time:
with open ('lorem.txt', 'rt') as myfile: # Open up lorem.txt for reading for myline in myfile: # For each line, read to a cord, print(myline) # and print the string.
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Notice that we're getting an actress line interruption ("newline") later every line. That's considering two newlines are being printed. The first one is the newline at the finish of every line of our text file. The second newline happens because, by default, print() adds a linebreak of its own at the end of whatever you've asked it to print.
Let's store our lines of text in a variable — specifically, a list variable — so we tin can wait at it more closely.
Storing text data in a variable
In Python, lists are similar to, just not the same as, an array in C or Java. A Python list contains indexed data, of varying lengths and types.
Case
mylines = [] # Declare an empty list named mylines. with open ('lorem.txt', 'rt') every bit myfile: # Open up lorem.txt for reading text data. for myline in myfile: # For each line, stored as myline, mylines.append(myline) # add together its contents to mylines. impress(mylines) # Print the listing.
The output of this plan is a little unlike. Instead of printing the contents of the listing, this program prints our list object, which looks like this:
Output:
['Lorem ipsum dolor sit down amet, consectetur adipiscing elit.\n', 'Nunc fringilla arcu congue metus aliquam mollis.\due north', 'Mauris nec maximus purus. Maecenas sit amet pretium tellus.\n', 'Quisque at dignissim lacus.\due north']
Here, we meet the raw contents of the list. In its raw object course, a list is represented as a comma-delimited listing. Here, each element is represented as a string, and each newline is represented as its escape character sequence, \due north.
Much like a C or Java assortment, the list elements are accessed by specifying an index number afterwards the variable name, in brackets. Alphabetize numbers start at cipher — other words, the nthursday element of a listing has the numeric index n-i.
Note
If yous're wondering why the alphabetize numbers start at zip instead of one, you lot're non lone. Computer scientists have debated the usefulness of zero-based numbering systems in the by. In 1982, Edsger Dijkstra gave his opinion on the subject, explaining why zilch-based numbering is the best way to index data in information science. Y'all can read the memo yourself — he makes a compelling argument.
Example
We tin print the kickoff element of lines by specifying index number 0, independent in brackets after the proper noun of the listing:
print(mylines[0])
Output:
Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.
Example
Or the 3rd line, by specifying index number two:
print(mylines[2])
Output:
Quisque at dignissim lacus.
Only if we endeavor to access an index for which there is no value, nosotros become an mistake:
Example
impress(mylines[iii])
Output:
Traceback (about recent call last): File <filename>, line <linenum>, in <module> print(mylines[3]) IndexError: list index out of range
Instance
A list object is an iterator, and then to print every element of the list, we can iterate over it with for...in:
mylines = [] # Declare an empty list with open up ('lorem.txt', 'rt') every bit myfile: # Open lorem.txt for reading text. for line in myfile: # For each line of text, mylines.append(line) # add that line to the list. for chemical element in mylines: # For each element in the listing, impress(element) # print it.
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Just nosotros're still getting extra newlines. Each line of our text file ends in a newline character ('\n'), which is existence printed. As well, later on printing each line, impress() adds a newline of its own, unless y'all tell information technology to do otherwise.
We can change this default behavior by specifying an end parameter in our print() phone call:
impress(element, terminate='')
By setting end to an empty string (two single quotes, with no space), nosotros tell print() to print nothing at the end of a line, instead of a newline character.
Example
Our revised program looks similar this:
mylines = [] # Declare an empty list with open up ('lorem.txt', 'rt') as myfile: # Open file lorem.txt for line in myfile: # For each line of text, mylines.append(line) # add that line to the list. for element in mylines: # For each element in the list, print(element, terminate='') # print information technology without extra newlines.
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
The newlines y'all meet here are actually in the file; they're a special character ('\n') at the end of each line. We desire to get rid of these, and so nosotros don't have to worry nigh them while we procedure the file.
How to strip newlines
To remove the newlines completely, we tin can strip them. To strip a string is to remove one or more characters, usually whitespace, from either the beginning or end of the string.
Tip
This procedure is sometimes also called "trimming."
Python three string objects accept a method called rstrip(), which strips characters from the correct side of a string. The English language reads left-to-right, and then stripping from the right side removes characters from the cease.
If the variable is named mystring, nosotros can strip its right side with mystring.rstrip(chars), where chars is a string of characters to strip. For example, "123abc".rstrip("bc") returns 123a.
Tip
When you correspond a cord in your program with its literal contents, it's chosen a cord literal. In Python (as in nearly programming languages), string literals are always quoted — enclosed on either side by single (') or double (") quotes. In Python, single and double quotes are equivalent; you can use one or the other, as long as they match on both ends of the string. It'south traditional to correspond a human being-readable string (such equally Hello) in double-quotes ("Hello"). If you're representing a single graphic symbol (such every bit b), or a single special grapheme such as the newline character (\north), information technology'due south traditional to use single quotes ('b', '\north'). For more information virtually how to use strings in Python, you tin can read the documentation of strings in Python.
The statement string.rstrip('\n') will strip a newline character from the correct side of string. The following version of our program strips the newlines when each line is read from the text file:
mylines = [] # Declare an empty list. with open ('lorem.txt', 'rt') every bit myfile: # Open lorem.txt for reading text. for myline in myfile: # For each line in the file, mylines.suspend(myline.rstrip('\north')) # strip newline and add to list. for chemical element in mylines: # For each element in the list, print(chemical element) # print it.
The text is now stored in a list variable, so individual lines can be accessed by alphabetize number. Newlines were stripped, so nosotros don't accept to worry near them. Nosotros tin always put them back later if we reconstruct the file and write information technology to disk.
Now, let's search the lines in the listing for a specific substring.
Searching text for a substring
Let's say we want to locate every occurrence of a sure phrase, or even a single letter of the alphabet. For case, peradventure we need to know where every "east" is. Nosotros can attain this using the cord'southward observe() method.
The list stores each line of our text as a string object. All string objects have a method, observe(), which locates the first occurrence of a substrings in the string.
Permit's employ the find() method to search for the letter of the alphabet "e" in the beginning line of our text file, which is stored in the list mylines. The first chemical element of mylines is a string object containing the first line of the text file. This string object has a find() method.
In the parentheses of find(), we specify parameters. The first and only required parameter is the string to search for, "e". The statement mylines[0].detect("eastward") tells the interpreter to search frontwards, starting at the outset of the cord, one grapheme at a time, until it finds the letter of the alphabet "e." When it finds 1, it stops searching, and returns the index number where that "e" is located. If information technology reaches the end of the string, it returns -one to point nothing was plant.
Example
print(mylines[0].find("e"))
Output:
3
The return value "3" tells u.s. that the letter "eastward" is the fourth character, the "due east" in "Lorem". (Remember, the alphabetize is zero-based: alphabetize 0 is the kickoff character, 1 is the second, etc.)
The detect() method takes two optional, boosted parameters: a kickoff index and a cease index, indicating where in the string the search should begin and end. For instance, string.notice("abc", 10, 20) searches for the substring "abc", but but from the 11th to the 21st character. If stop is not specified, find() starts at index start, and stops at the end of the string.
Instance
For case, the following argument searchs for "e" in mylines[0], beginning at the fifth character.
print(mylines[0].detect("e", 4))
Output:
24
In other words, starting at the 5th grapheme in line[0], the starting time "e" is located at index 24 (the "e" in "nec").
Example
To start searching at index 10, and stop at index 30:
print(mylines[ane].discover("e", 10, 30))
Output:
28
(The first "eastward" in "Maecenas").
If find() doesn't locate the substring in the search range, information technology returns the number -i, indicating failure:
print(mylines[0].discover("eastward", 25, 30))
Output:
-one
There were no "e" occurrences betwixt indices 25 and thirty.
Finding all occurrences of a substring
But what if we want to locate every occurrence of a substring, not just the first one we encounter? Nosotros can iterate over the string, starting from the index of the previous match.
In this example, nosotros'll use a while loop to repeatedly find the alphabetic character "e". When an occurrence is plant, we phone call observe again, starting from a new location in the cord. Specifically, the location of the last occurrence, plus the length of the string (so we tin can movement forward past the last 1). When find returns -one, or the start index exceeds the length of the cord, we terminate.
# Build array of lines from file, strip newlines mylines = [] # Declare an empty list. with open ('lorem.txt', 'rt') every bit myfile: # Open up lorem.txt for reading text. for myline in myfile: # For each line in the file, mylines.append(myline.rstrip('\north')) # strip newline and add to list. # Locate and print all occurences of letter "east" substr = "e" # substring to search for. for line in mylines: # cord to be searched index = 0 # current index: character beingness compared prev = 0 # previous alphabetize: final character compared while index < len(line): # While index has not exceeded string length, index = line.find(substr, alphabetize) # set up index to first occurrence of "e" if alphabetize == -1: # If nothing was found, break # exit the while loop. impress(" " * (index - prev) + "eastward", terminate='') # print spaces from previous # match, then the substring. prev = index + len(substr) # remember this position for adjacent loop. index += len(substr) # increase the index by the length of substr. # (Repeat until alphabetize > line length) impress('\n' + line); # Print the original string under the e'south
Output:
e due east east due east e Lorem ipsum dolor sit amet, consectetur adipiscing elit. due east eastward Nunc fringilla arcu congue metus aliquam mollis. e e e e e e Mauris nec maximus purus. Maecenas sit amet pretium tellus. eastward Quisque at dignissim lacus.
Incorporating regular expressions
For circuitous searches, use regular expressions.
The Python regular expressions module is called re. To utilize it in your program, import the module earlier y'all use information technology:
import re
The re module implements regular expressions by compiling a search blueprint into a pattern object. Methods of this object can then be used to perform match operations.
For example, let'south say you want to search for any word in your document which starts with the alphabetic character d and ends in the letter r. We can accomplish this using the regular expression "\bd\westward*r\b". What does this mean?
character sequence | meaning |
---|---|
\b | A give-and-take boundary matches an empty string (annihilation, including nothing at all), just merely if information technology appears before or later on a non-give-and-take character. "Discussion characters" are the digits 0 through 9, the lowercase and capital letter letters, or an underscore ("_"). |
d | Lowercase letter d. |
\w* | \due west represents whatsoever word character, and * is a quantifier meaning "naught or more of the previous character." And then \w* will match zero or more than word characters. |
r | Lowercase letter r. |
\b | Word boundary. |
So this regular expression volition match whatever string that can be described every bit "a word boundary, then a lowercase 'd', then nix or more word characters, then a lowercase 'r', then a discussion boundary." Strings described this way include the words destroyer, dour, and doctor, and the abbreviation dr.
To employ this regular expression in Python search operations, we beginning compile it into a pattern object. For instance, the following Python argument creates a pattern object named design which nosotros can use to perform searches using that regular expression.
pattern = re.compile(r"\bd\w*r\b")
Note
The letter r earlier our string in the statement above is important. It tells Python to interpret our cord as a raw cord, exactly as we've typed information technology. If we didn't prefix the string with an r, Python would translate the escape sequences such every bit \b in other ways. Whenever you need Python to interpret your strings literally, specify information technology as a raw string by prefixing information technology with r.
Now we tin use the pattern object's methods, such as search(), to search a string for the compiled regular expression, looking for a match. If information technology finds one, it returns a special outcome called a match object. Otherwise, it returns None, a built-in Python constant that is used like the boolean value "false".
import re str = "Good morning, doc." pat = re.compile(r"\bd\w*r\b") # compile regex "\bd\w*r\b" to a blueprint object if pat.search(str) != None: # Search for the pattern. If institute, print("Institute information technology.")
Output:
Establish it.
To perform a case-insensitive search, you can specify the special abiding re.IGNORECASE in the compile stride:
import re str = "Hi, DoctoR." pat = re.compile(r"\bd\w*r\b", re.IGNORECASE) # upper and lowercase will match if pat.search(str) != None: print("Institute it.")
Output:
Found it.
Putting it all together
And then now we know how to open a file, read the lines into a listing, and locate a substring in any given list chemical element. Let's utilise this cognition to build some example programs.
Print all lines containing substring
The program below reads a log file line past line. If the line contains the discussion "mistake," it is added to a list called errors. If not, information technology is ignored. The lower() cord method converts all strings to lowercase for comparing purposes, making the search case-insensitive without altering the original strings.
Note that the detect() method is called directly on the event of the lower() method; this is called method chaining. Also, note that in the print() statement, we construct an output cord by joining several strings with the + operator.
errors = [] # The listing where we volition store results. linenum = 0 substr = "mistake".lower() # Substring to search for. with open ('logfile.txt', 'rt') as myfile: for line in myfile: linenum += i if line.lower().find(substr) != -i: # if instance-insensitive match, errors.append("Line " + str(linenum) + ": " + line.rstrip('\north')) for err in errors: print(err)
Input (stored in logfile.txt):
This is line ane This is line two Line 3 has an error! This is line four Line 5 besides has an error!
Output:
Line 3: Line three has an error! Line 5: Line 5 besides has an error!
Extract all lines containing substring, using regex
The programme below is similar to the above program, but using the re regular expressions module. The errors and line numbers are stored as tuples, e.k., (linenum, line). The tuple is created by the boosted enclosing parentheses in the errors.append() argument. The elements of the tuple are referenced like to a list, with a cypher-based index in brackets. As constructed here, err[0] is a linenum and err[one] is the associated line containing an mistake.
import re errors = [] linenum = 0 pattern = re.compile("error", re.IGNORECASE) # Compile a instance-insensitive regex with open ('logfile.txt', 'rt') equally myfile: for line in myfile: linenum += 1 if pattern.search(line) != None: # If a match is found errors.append((linenum, line.rstrip('\northward'))) for err in errors: # Iterate over the list of tuples print("Line " + str(err[0]) + ": " + err[1])
Output:
Line six: Mar 28 09:10:37 Error: cannot contact server. Connection refused. Line 10: Mar 28 x:28:15 Kernel mistake: The specified location is not mounted. Line fourteen: Mar 28 xi:06:30 ERROR: usb 1-ane: tin can't set config, exiting.
Extract all lines containing a phone number
The program below prints whatsoever line of a text file, info.txt, which contains a US or international phone number. Information technology accomplishes this with the regular expression "(\+\d{1,2})?[\s.-]?\d{3}[\southward.-]?\d{four}". This regex matches the following phone number notations:
- 123-456-7890
- (123) 456-7890
- 123 456 7890
- 123.456.7890
- +91 (123) 456-7890
import re errors = [] linenum = 0 pattern = re.compile(r"(\+\d{one,ii})?[\s.-]?\d{3}[\due south.-]?\d{4}") with open ('info.txt', 'rt') as myfile: for line in myfile: linenum += i if pattern.search(line) != None: # If pattern search finds a match, errors.suspend((linenum, line.rstrip('\n'))) for err in errors: impress("Line ", str(err[0]), ": " + err[1])
Output:
Line 3 : My phone number is 731.215.8881. Line 7 : Yous can accomplish Mr. Walters at (212) 558-3131. Line 12 : His amanuensis, Mrs. Kennedy, tin can be reached at +12 (123) 456-7890 Line 14 : She tin likewise exist contacted at (888) 312.8403, extension 12.
Search a lexicon for words
The program beneath searches the lexicon for any words that start with h and cease in pe. For input, it uses a lexicon file included on many Unix systems, /usr/share/dict/words.
import re filename = "/usr/share/dict/words" blueprint = re.compile(r"\bh\westward*pe$", re.IGNORECASE) with open up(filename, "rt") as myfile: for line in myfile: if design.search(line) != None: print(line, stop='')
Output:
Hope heliotrope hope hornpipe horoscope hype
mastrohaptinseele.blogspot.com
Source: https://www.computerhope.com/issues/ch001721.htm
0 Response to "How to Read Newline in .txt File"
Post a Comment