How does the String object work in Java?

I’m guessing the real answer is “no one that’s not employed at Sun knows for sure” but let’s brainstorm for a second if that’s the case.

Now, String isn’t a primitive, it’s a class, when you make a String it’s a “String object” unlike an int which is, well, just an int (hence the existence of wrapper classes like Integer). The only issue I have with this is that String seems to be able to do something no other object does, specifically that it seems to have a special declaration, going so far as being able to pass a String object as an argument without making an instance of it first, which is pretty much impossible otherwise (At least I’ve never seen a way to write a class that allows you to input data by, say, putting it between ampersands, i.e. AwesomeClass example = &This is so Cool, ****&, which could be interpreted however).

We speculated on this for a little bit before my lab section today, we thought maybe since “<stuff>” was such a common way of declaring a String that Sun just decided to make it be interpreted as a char array (which, at its most basic level, is what a String is with some fancy getter methods like substring). But none of the following work:


String whatever = new String[{'a', 'b', 'c'}];
String something = {'a','b','c'};
char[] test = {'a','b','c'};
String somethingElse = new String[test];
//Or conversely
char[] doesThisWork = "abc";


Or the other billion ways to write the same thing. It seems like String just “is” which confuses me a bit, if that’s the case why not just make str a primitive and have String as a wrapper class? The only reason I can think not to is primitives don’t have things like <whatever>.substring(a, b), but that could be worked around by adding something like java.util.Substring that takes (str, index1, index2).

Anyone have any insight?

If your question is just “What’s up with the ability to make constant strings just by placing text between quotes? What object instances are being referred to when this is done?”, you can think of what’s actually going on as this:

Whenever you write


whateverFunction("Hello World");

This is automatically interpreted by the compiler into something like



String temporary = new String; //Makes an empty string
temporary.addChar('H'); //Stick an 'H' onto the end
temporary.addChar('e'); //Stick an 'e' onto the end
...
temporary.addChar('d');
whateverFunction(temporary);


Now, that may not be literally what’s going on, but it’s essentially the same. There actually is an object instance for each string, just sometimes created implicitly instead of explicitly.

I’m not exactly sure what you are asking. I’m pretty sure that String specifically has special support in the Java compiler, and quoted strings are just considered to become Strings. Its supported at a lower level than anything else, even object, because even an object is dependent on string for at least one of its methods (toString).

I may be confusing C# and Java here a bit, but I believe Java also interns its string constants so that when you create a new string with the same name of another one already created, its just a reference to the same interned address where the first string was created.

Its just a class with extra support in Java, because strings are so important and so widely used. I do think C# and Java should both just go ahead and make strings a native type however, but until then, strings will be kinda in limbo in a middle status between ints and Foos.

That’s kind of what I thought, it’s just supported at a lower level then, they didn’t make the String class entirely in Java like (from what I understand) most non-primitives are. (Also, what is a Foo exactly? I thought that was just a variable placeholder name and the internet agrees, are you just using it to mean “generic non-primitive datatype”?)

I realize my question was a little unclear, I guess what I was saying is almost akin to “what the hell is a String anyway?” Which I admit is a terrible question, I was partially asking what Indistinguishable explained, I think I was more confused at why, if what Indistinguishable said is basically the case, why does it only accept a String constant as an argument if it’s basically just making an array of chars in the first place? Because that seems like the sort of territory of a primitive. I guess " in limbo" is a fairly good description, just a specially supported class due to its importance then (I still think it’s stupid, but at least I can say it’s stupid while being somewhat informed now).

It is a class with a special constructor. Furthermore, strings are immutable and are stored once created and pointed at by any variable that contains the same string (String.equals() is equivalent to string1 == string2). Arrays are another special type of object.

FWIW,
Rob

When you create a string constant in your code, you’re creating an anonymous class, just like how you can create anonymous functions in Java. Note, for instance, that you can write code like:

int length = (“Hello”).length();

I’m not sure if you can do the same thing with numeric constants without casting them to Integer or Double or whatever.

It’s been a while since I worked with Java, but I thought that this wasn’t the case.

String.equals() was what I WANTED for string1 == string2 - IE, do they have the same characters in the same order.

string1 == string2 was done as a test of pointer equality - are they references that point to the same memory location.

Sorry if I’m sidetracking.

Yeah I just meant Foo as in some class that you or the libraries declare that’s not special.

I think the legacy of C++ may have had something to do with the way it is in Java and C#. The accepted way to do strings in C and C++ was to forgot about doing it the native way with an array of char, and instead use a library class, so maybe they didn’t even think about making string a native type cause they didn’t want to run into all the trouble that c char based strings had (have to keep track of length separately, not much support for changing size dynamically, and so on.)

It would have been cumbersome to have to do.

char mycarray = {‘H’, ‘E’, ‘L’, ‘L’, ‘O’, ’ ', ‘W’, ‘O’, ‘R’, ‘L’, ‘D’};
String mystring = new String(mycarray, 11);
cout << mystring;

So they had to have some support and we are in the mess we are in.

It’s not the case. Strings are immutable, so if Java has the chance it will reuse existing String objects when you declare a new String. However, you can force it to allocate a new String if you want. In that case, the Strings will point to different locations, but contain the same values - .equals() will return true while == will return false.

If that’s not complicated enough, there’s a method on String, intern(), that will return the canonical representation of the String. If you call that on all your Strings you can use == to compare them. Nobody ever does this.

Sample code to make this clear below:



public class Main
{
    public static void main(String[] args)
    {
        String s1 = "a";
        String s2 = "a";
        String s3 = new String("a");  // Don't do this - you pollute the stack with an extra String

        System.out.println("s1 == s2 : " + (s1 == s2));
        System.out.println("s1 == s3 : " + (s1 == s3));
        System.out.println("s2 == s3 : " + (s2 == s3));
        System.out.println("s1.equals(s2) : " + (s1.equals(s2)));
        System.out.println("s2.equals(s3) : " + (s2.equals(s3)));

        s2 = s2.intern();
        s3 = s3.intern();
        System.out.println("Interned the strings...");
        System.out.println("s1 == s2 : " + (s1 == s2));
        System.out.println("s1 == s3 : " + (s1 == s3));
        System.out.println("s2 == s3 : " + (s2 == s3));
        System.out.println("s1.equals(s2) : " + (s1.equals(s2)));
        System.out.println("s2.equals(s3) : " + (s2.equals(s3)));
    }
}


Yields the output:
s1 == s2 : true
s1 == s3 : false
s2 == s3 : false
s1.equals(s2) : true
s2.equals(s3) : true
Interned the strings…
s1 == s2 : true
s1 == s3 : true
s2 == s3 : true
s1.equals(s2) : true
s2.equals(s3) : true

Sorry, I didn’t see your point about interning Strings. Yeah, Java does this, but it’s more of a memory savings than anything you can rely on. I believe the split is any constant String is automatically interned (“foo”, or even “foo” + “bar” since the compiler optimizes it) but anything coming through a String constructor isn’t. Since you don’t know how a toString() generates it’s return value, you either wind up calling intern() a lot more than you need to, or you ignore it. Pity, really.

The “stuff between quotes” is just syntax. It’s nothing special, except that java has so very little syntax for creating object literals that it appears special; basically, the only other “non-primitive” type that has special syntax is Array.

In other words, it’s the parser/compiler that translates


"foo"

into something like


new String({'f','o','o'})

, except that it probably uses interning etc to be more efficient.

That’s because a String in java cannot be just a character array, because Strings are immutable but Arrays aren’t.

If you’re really interested in this kind of stuff, I suggest you try Common Lisp for a bit. It’s interesting in many ways, but for this subject, it’s very interesting because all the intricate stuff the reader (“parser”) and compiler do together is completely specified and most of it can even be modified by the programmer. It should give you a fairly good insight in how languages in general are implemented. This book is free online and gives a pretty good intro for people who are already programmers

Whoops, I forgot in my post above that Java Strings are immutable. Modify it accordingly.

So wait, I’m a little confused, when interning does it always do the whole String?

For example:


String a = "foo";
String b = "bar";
String c = "foo" + "bar";


Would this end up referencing interned locations (“foo” “bar” and “foobar”) or would String c’s reference then point to the interned location of both “foo” and “bar”?

In other words
a is (I know this isn’t quite the way it outputs, but I’m trying to be brief) String;@19821f
b is String;@198220
c is String;@198221

Where 19821f will end up pointing to “foo” in the intern pool, 198220 ends up referring to “bar” in the intern pool, does 198221 contain TWO references (to “foo” and “bar”) or to a different spot in the pool called “foobar”?

(I did not know about interning until now, thanks for the explanations)

Probably. Interning is mostly useful when doing comparisons. It’s useful to be able to quickly do an equals() on many “instances” of strings out of a limited set, so java tries to minimize the number of distinct strings that are equal(). This strategy can also reduce the amount of memory needed, but AFAIK in practice it doesn’t really help much for that, so I suspect Java doesn’t try do “pack” substrings together internally.

Some languages have a distinct Symbol type for this precise problem. Symbols are always interned, and global (every symbol of a given name is equal - actually identical - it’s the same object - to any other symbol of the same name anywhere in any namespace/class) so equals() operations are extremely quick, but that also means using Symbols for strings that aren’t reused or compared gives you a performance hit - in fact, the string representation of a symbol is pretty much only useful for serializing and debugging. Java in some ways tries to handle both issues with Strings (and numeric CONSTANTS), but it’s not an ideal situation.

Interning is a very low-level optimization thing which you might as well ignore and forget you ever heard about unless you really have reason to care; from the perspective of the typical Java programmer, it’s not something you should assume happens. But, to answer your question, String c’s reference will point to an entirely different object than that of a and b, since its contents are entirely different. Indeed, the + operation will create a new instance of String (whose contents are given by concatenation) and return a reference to it. The only case where two references can ever end up being the same is where the they both point to the exact same data.

[Nor, I should make clear, is it something you should assume doesn’t happen. You should generally shy away from making assumptions about what the rules of this phenomenon are…]

Java doesn’t have the concept of a string made up of substrings, so a, b, and c all point to different memory locations.

In this particular case, String c gets optimized by the parser, and it generates the same bytecode as String c = “foobar”; If instead it was String c = a + b; it would generate bytecode to do the concatenation (I’m pretty sure - may have changed in recent versions) but it would still be a different String. By the way, the other big place that String breaks the rules has been staring at you, but you’re too used to it to notice - String is the only(?) class that supports operator overloading in Java. You can’t write Clazz class = a + b; where a and b are both Clazz.

Wise words from a wise man. Even if the code you write works with the interning rules, I guarantee the next person to maintain your code will curse your name unto the tenth generation.

Anyone want to talk about weak references and memory leaks?

Not in Java, I don’t. :slight_smile:

Yes, well, it’s not exactly difficult to make programmers curse your name unto the tenth generation, reminds me of the… colorful… debate about linked priority lists I was lucky enough to be in the room for (“you should swap by data stored, it’s so much easier and the code is so clean!” “But it’s just WRONG, you should swap the references!” and so on for about an hour)

You can use numeric constants where an Integer or Double is required and vice versa (but only in limited situations like inserting to or retrieving from a collection). This is called autoboxing and unboxing.